571e Single Sequence Secondary Structure Prediction for Globular Proteins

Ashwin Subramani and Christodoulos A. Floudas. Department of Chemical Engineering, Princeton University, Engineering Quadrangle, Olden Street, Princeton, NJ 08544

The secondary structure prediction of a protein is an important intermediate step in the three dimensional structure prediction of a protein. Its importance comes particularly in focus in first principles based approaches to protein structure prediction, which do not use any database information in the form of homology information [1]. The most common techniques for secondary structure prediction feature a 3 – class secondary structure state for each residue of a protein. A number of methods are available in literature for secondary structure prediction. The most common methods use multiple sequence alignment techniques like BLAST and PSI-BLAST [2] to derive profile information for the target protein, which is fed as input to machine learning techniques like neural networks [3] and SVMs [4] .

We have developed an optimization – based method to predict the secondary structure of a target protein without the use of profile information. Hence, this method can be applied to proteins which do not produce reliable profile information using sequence alignment tools. The model combines two models, an α – helix prediction model HELIOS (HELical prediction using Integer Optimization approacheS) and a β – strand prediction prediction model BEST-PRED (BEta STrand PREDiction). For α – helix prediction, a two – stage infeasibility minimization problem has been introduced. The first stage is a linear programming (LP) model for parameter estimation, while the second stage is an integer programming (ILP) model for helix prediction. The residues of a target protein are divided into 4 regions depending on their putative proximity to the helix termini, and propensity to be in helices is compared to a pre – evaluated residue – dependent threshold propensity, using overlapping nonapeptides surrounding the central residue. BEST-PRED for β – strand prediction has been introduced as an integer programming (ILP) model, which maximizes a residue's propensity to be in a β – strand. The protein is divided into overlapping pentapeptides. The β – strand propensity weight for the central residue is evaluated by implementing a novel combination of Naļve – Bayesian and first order Markov models, which represent the physical nature of a β – strand. In both models, important mathematical constraints are introduced to ensure that biologically meaningful results are presented. These constraints refer to the physical nature of the residues [5], along with the minimum and maximum secondary structure content [6]. A formulation of this kind not only provides the secondary structure prediction corresponding to the evaluated global minima, but also has the ability to provide a rank – ordered list of best solutions. Such a rank – ordered list can help in finding the most frequent predictions in a particular class for a given residue. Further, the formulation allows the user to add any form of prior knowledge about the secondary structure easily. This method was tested on a set of α, β and mixed α- β proteins, and the preliminary results are very encouraging. A Q_α accuracy of 82% was obtained for purely α – helical proteins using HELIOS, while a Q_β accuracy of 78.9% was obtained for purely β – proteins using BEST-PRED. These results compare very favorably with some of the standard secondary structure prediction servers.

[1] Klepeis J.L. and Floudas C.A., 2003, ASTRO – FOLD: a combinatorial and global optimization framework for ab initio prediction of three – dimensional structures of proteins from the amino acid sequence, Biophysical Journal, 85, 2119 – 2146.

[2] Altschul SF, Gish W., Miller W., Myers E.W. and Lipman D.J., 1997, Gapped BLAST and PSI – BLAST: a new generation of protein database search programs, Nucleic Acids Research, 25, 3389 – 3402.

[3] McGuffin L.J., Bryson K and Jones D.T., 2000, The PSIPRED protein structure prediction server, Bioinformatics, 16, 404 – 405.

[4] Gassend B., O'Donnell C.W., Thiel W., Lee A., van Dijk M. and Devadas S., 2007, BMC Bioinformatics, 8, S3.

[5] Aurora R. and Rose G.D., 1998, Helix Capping, Protein Science, 7, 21 – 38.

[6] Homaeian L., Kurgan L. A., Ruan J., Cios K. J. and Chen K., 2007, Prediction of protein secondary structure content for twilight zone sequences, Proteins, 69, 486 – 498.