Methodology
Nomenclature of the substrate specificity of proteases
The web interface is fairly straightforward to use: the user only needs to input the one-letter FASTA format of the sequence of the query substrate protein. A typical task for a query substrate sequence with ~500 residues long will normally take 8-12 minutes. Once the prediction task is completed, the user will receive an Email containing a link directing to a webpage containing all the prediction results of the query sequence.
Proteases specifically
cleave protein substrates from the N- or C- terminus, or in the middle of
the substrate, through the binding of the protease active site to the substrate
residues flanking the cleavage site. As defined by Schechter and Berger, the
active site residues in the protease are composed of contiguous pockets termed
subsites (Schechter and Berger, 1967). Each subsite pocket binds to a corresponding
residue in the substrate sequence, referred to here as the sequence position.
According to this definition, amino acid residues in the substrate sequence
are consecutively numbered outward from the cleavage sites as ...-P4-P3-P2-P1-P1'-P2'-P3'-P4'-...(the
scissile bond is located between the P1 and P1' positions), while the subsites
in the active site are correspondingly labelled as ...-S4-S3-S2-S1-S1'-S2'-S3'-S4'-...
(Figure 1).
Figure 1. The nomenclature of the substrate specificity of protease.
We built the PROSPER models for predicting cleavage sites of protease substrates by combining a number of complementary sequence and structural features including local amino acid sequence profile, predicted secondary structure, solvent accessibility and natively disordered region, as well as some global sequence features. In particular, local sequence profiles surrounding the cleavage sites were extracted using a bi-profile Bayesian approach (Shao et al., 2009; Song et al., 2010).
For a submitted substrate sequence by the user, its local sequence and structural profiles surrounding the potential cleavage sites will be extracted and input into PROSPER, as shown in Figure 2. In particular, we employed a local sliding window approach (either the P4-P2' or P8-P8'local window, as shown in Figure 2) to extract and encode the sequence and structural features of potential substrate cleavage sites using the bi-profile Bayesian feature extraction approach. As described above and in our previous work (Song et al., 2010), these features are divided into four different types: four different types of sequence/structure profiles: (i) bi-profile Bayesian amino acid profile (BPBAA);(ii) bi-profile Bayesian secondary structure profile (BPBSS); (iii) bi-profile Bayesian solvent accessibility profile (BPBSA) and (iv) bi-profile Bayesian disordered profile (BPBDISO). Given the encoding scheme and a potential cleavage site, its feature vector for input into the SVR model will be encoded by concatenating the constitutive features of the corresponding scheme.
Figure 2. A sliding window approach to extract the local sequence and structural features surrounding the potential cleavage sites.
The procedures of generating local sequence input features of PROSPER is illustrated in the following Figure 3. PROSPER is comprised of three modules: the input, the prediction and the output module. In the input module, user's submitted protein sequence in the FASTA format will be processed: PSI-BLAST, PSIPRED, SCRATCH and DISOPRED will be called to search this sequence against the non-redundant NCBI nr database. In the prediction module, the generated matrix profiles of secondary structure, solvent accessibility and native disorder information as well as the global sequence features will be used as the input to the prediction module. Finally, the output module will summarize the prediction results of substrate cleavage sites and send them to the user's Email address.
Figure 3. The process flow diagram of PROSPER server.
Cleavage scoring of potential cleavage sites by machine learning techniques
The web interface is fairly straightforward to use: the user only needs to input the one-letter FASTA format of the query sequence. A typical task for a query sequence with ~500 residues long will normally take 8-12 minutes. Once the prediction task is completed, the results will be returned to the screen.
Substrate cleavage site prediction can be formulated as a binary classification problem, i.e. being classified as either a cleavage site (positive) or non-cleavage site (negative). In our case, we employed a machine learning technique- support vector machine (SVM) to solve the difficult task of predicting substrate cleavage sites of different proteases. SVM is an efficient classification algorithm suitable for solving the binary classification or multiple classification problems. As a supervised machine learning technique based on structural risk minimization from statistical learning theory (Vapnik, 2000), SVM is able to distinguish positive from negative samples by transforming the data into a higher dimensional space and constructing an optimal separating hyperplane by the use of so-called kernel functions, where two linearly non-separable classes of samples can become separable (Vapnik, 2000).
We used support vector regression (SVR) to build the PROSPER models in order to quantitatively predict the substrate cleavage probability of differnt proteases. SVR is a supervised machine learning technique based on the structural risk minimization from statistical learning theory and it has an outstanding ability in predicting the raw values of the tested samples. It is especially effective when the input data is characterized by high dimension and non-linear function. For the implementation of the SVR approach, we used the SVM_light package, an implementation of Vapnik¡¯s SVM for support vector classification, regression and pattern recognition.
Bi-profile Bayesian feature extraction
We exploited and evaluated several differnt types of sequence profiles based on the bi-profile Bayesian feature approach (Shao et al., 2009). This approach would be particularly useful when dealing with an unbalanced dataset comprising a smaller amount of positive samples and greater number of negative samples. More technical details about Bi-profile Bayesian signature extraction can be found at Shao et al. (2009). We integrated the bi-profile Bayesian signatures to build the PROSPER models and predict protease cleavage sites using different combinations of the following sequence and structural profiles:
(1) bi-profile Bayesian amino acid profile (BPBAA);
(2) bi-profile Bayesian secondary structure profile (BPBSS);
(3) bi-profile Bayesian solvent accessibility profile (BPBSA);
(4) bi-profile Bayesian disordered profile (BPBDISO).
Based on these profiles, we investigated the predictive performances of different combinations of these profiles with the increasing complexity of sequence features. This step-wise procedure can reveal the contribution of individual features to the predictive performances. Given a potential cleavage site, its feature vector for input into the SVR model will be encoded by concatenating the constitutive features of the corresponding encoding scheme.