Editing Protein structure prediction (section)

==Secondary structure==
{{Main|List of protein secondary structure prediction programs}}

'''Secondary structure prediction''' is a set of techniques in [[bioinformatics]] that aim to predict the local [[secondary structure]]s of [[protein]]s based only on knowledge of their [[amino acid]] sequence. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely [[Alpha helix|alpha helices]], [[beta sheet|beta strand]]s (often termed ''extended'' conformations), or [[Turn (biochemistry)|turns]]. The success of a prediction is determined by comparing it to the results of the [[DSSP (algorithm)|DSSP]] algorithm (or similar e.g. [[STRIDE (algorithm)|STRIDE]]) applied to the [[X-ray crystallography|crystal structure]] of the protein. Specialized algorithms have been developed for the detection of specific well-defined patterns such as [[Transmembrane helix|transmembrane helices]] and [[coiled coil]]s in proteins.<ref name="Mount"/>

The best modern methods of secondary structure prediction in proteins were claimed to reach 80% accuracy after using machine learning and [[sequence alignment]]s;<ref>{{cite book |vauthors=Pirovano W, Heringa J |title=Data Mining Techniques for the Life Sciences |chapter=Protein Secondary Structure Prediction |volume=609 |pages=327–48 |year=2010 |pmid=20221928 |doi=10.1007/978-1-60327-241-4_19 |isbn=978-1-60327-240-7 |series=Methods in Molecular Biology}}</ref> this high accuracy allows the use of the predictions as feature improving [[fold recognition]] and [[ab initio]] protein structure prediction, classification of [[structural motif]]s, and refinement of [[sequence alignment]]s. The accuracy of current protein secondary structure prediction methods is assessed in weekly [[Benchmark (computing)|benchmarks]] such as [[LiveBench]] and [[EVA (benchmark)|EVA]].

===Background===
Early methods of secondary structure prediction, introduced in the 1960s and early 1970s,<ref>{{cite journal |vauthors=Guzzo AV |title=The influence of amino-acid sequence on protein structure |journal=Biophysical Journal |volume=5 |issue=6 |pages=809–22 |date=November 1965 |pmid=5884309 |pmc=1367904 |doi=10.1016/S0006-3495(65)86753-4 |bibcode=1965BpJ.....5..809G}}</ref><ref>
{{cite journal |vauthors=Prothero JW |title=Correlation between the distribution of amino acids and alpha helices |journal=Biophysical Journal |volume=6 |issue=3 |pages=367–70 |date=May 1966 |pmid=5962284 |pmc=1367951 |doi=10.1016/S0006-3495(66)86662-6 |bibcode=1966BpJ.....6..367P}}</ref><ref>
{{cite journal |vauthors=Schiffer M, Edmundson AB |title=Use of helical wheels to represent the structures of proteins and to identify segments with helical potential |journal=Biophysical Journal |volume=7 |issue=2 |pages=121–35 |date=March 1967 |pmid=6048867 |pmc=1368002 |doi=10.1016/S0006-3495(67)86579-2 |bibcode=1967BpJ.....7..121S}}</ref><ref>
{{cite journal |vauthors=Kotelchuck D, Scheraga HA |title=The influence of short-range interactions on protein onformation. II. A model for predicting the alpha-helical regions of proteins |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=62 |issue=1 |pages=14–21 |date=January 1969 |pmid=5253650 |pmc=285948 |doi=10.1073/pnas.62.1.14 |bibcode=1969PNAS...62...14K |doi-access=free}}</ref><ref>
{{cite journal |vauthors=Lewis PN, Go N, Go M, Kotelchuck D, Scheraga HA |title=Helix probability profiles of denatured proteins and their correlation with native structures |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=65 |issue=4 |pages=810–5 |date=April 1970 |pmid=5266152 |pmc=282987 |doi=10.1073/pnas.65.4.810 |bibcode=1970PNAS...65..810L |doi-access=free}}</ref> focused on identifying likely alpha helices and were based mainly on [[helix-coil transition model]]s.<ref name="Froimowitz">{{cite journal |vauthors=Froimowitz M, Fasman GD |title=Prediction of the secondary structure of proteins using the helix-coil transition theory |journal=Macromolecules |volume=7 |issue=5 |pages=583–9 |year=1974 |pmid=4371089 |doi=10.1021/ma60041a009 |bibcode=1974MaMol...7..583F}}</ref> Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60–65% accurate, and often underpredict beta sheets.<ref name="Mount"/> Since the 1980s, [[artificial neural networks]] have been applied to the prediction of protein structures.<ref>{{cite journal |last1=Qian |first1=Ning |last2=Sejnowski |first2=Terry J. |author2-link=Terry Sejnowski |year=1988 |title=Predicting the secondary structure of globular proteins using neural network models.|url=http://www.columbia.edu/~nq6/publications/protein.pdf|journal=Journal of Molecular Biology|volume=202|issue=4|pages=865–884|doi=10.1016/0022-2836(88)90564-5|pmid=3172241|id=Qian1988}}</ref><ref>{{cite journal|last1=Rost|first1=Burkhard|author-link1=Burkhard Rost|last2=Sander|first2=Chris|year=1993|title=Prediction of protein secondary structure at better than 70% accuracy|url=http://www.cs.albany.edu/~berg/sta650/Assignments/RostSander93.pdf|journal=Journal of Molecular Biology|volume=232|issue=2|pages=584–599|doi=10.1006/jmbi.1993.1413|pmid=8345525|id=Rost1993|access-date=2023-04-20|archive-date=2019-01-31|archive-url=https://web.archive.org/web/20190131040806/http://www.cs.albany.edu/~berg/sta650/Assignments/RostSander93.pdf|url-status=dead}}</ref>
The [[evolution]]ary [[conservation (genetics)|conservation]] of secondary structures can be exploited by simultaneously assessing many [[Sequence homology|homologous sequences]] in a [[multiple sequence alignment]], by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modern [[machine learning]] methods such as [[artificial neural network|neural nets]] and [[support vector machine]]s, these methods can achieve up to 80% overall accuracy in [[globular protein]]s.<ref name="Dor">{{cite journal |vauthors=Dor O, Zhou Y |title=Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training |journal=Proteins |volume=66 |issue=4 |pages=838–45 |date=March 2007 |pmid=17177203 |doi=10.1002/prot.21298 |s2cid=14759081}}</ref> The theoretical upper limit of accuracy is around 90%,<ref name="Dor"/> partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Moreover, the typical secondary structure prediction methods do not account for the influence of [[tertiary structure]] on formation of secondary structure; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.

===Historical perspective===
To date, over 20 different secondary structure prediction methods have been developed. One of the first algorithms was [[Chou–Fasman method]], which relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure.<ref name="Chou">{{cite journal |vauthors=Chou PY, Fasman GD |title=Prediction of protein conformation |journal=Biochemistry |volume=13 |issue=2 |pages=222–45 |date=January 1974 |pmid=4358940 |doi=10.1021/bi00699a002}}</ref> The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughly 50–60% accurate in predicting secondary structures.<ref name="Mount"/>

The next notable program was the [[GOR method]] is an [[information theory]]-based method. It uses the more powerful probabilistic technique of [[Bayesian inference]].<ref name="Garnier">{{cite journal |vauthors=Garnier J, Osguthorpe DJ, Robson B |title=Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins |journal=Journal of Molecular Biology |volume=120 |issue=1 |pages=97–120 |date=March 1978 |pmid=642007 |doi=10.1016/0022-2836(78)90297-8}}</ref> The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the [[conditional probability]] of the amino acid assuming each structure given the contributions of its neighbors (it does not assume that the neighbors have that same structure). The approach is both more sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities are only strong for a small number of amino acids such as [[proline]] and [[glycine]]. Weak contributions from each of many neighbors can add up to strong effects overall. The original GOR method was roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicted as loops or disorganized regions.<ref name="Mount"/>

Another big step forward, was using [[machine learning]] methods. First [[artificial neural network]]s methods were used. As a training sets they use solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of [[hydrogen bonding]] patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet.<ref name="Mount"/> [[Psipred|PSIPRED]] and [[Jpred|JPRED]] are some of the most known programs based on neural networks for protein secondary structure prediction. Next, [[support vector machine]]s have proven particularly useful for predicting the locations of [[turn (biochemistry)|turns]], which are difficult to identify with statistical methods.<ref name="Pham">{{cite journal |vauthors=Pham TH, Satou K, Ho TB |title=Support vector machines for prediction and analysis of beta and gamma-turns in proteins |journal=Journal of Bioinformatics and Computational Biology |volume=3 |issue=2 |pages=343–58 |date=April 2005 |pmid=15852509 |doi=10.1142/S0219720005001089}}</ref><ref name="Zhang">{{cite journal |vauthors=Zhang Q, Yoon S, Welsh WJ |title=Improved method for predicting beta-turn using support vector machine |journal=Bioinformatics |volume=21 |issue=10 |pages=2370–4 |date=May 2005 |pmid=15797917 |doi=10.1093/bioinformatics/bti358 |doi-access=}}</ref>

Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as [[protein backbone|backbone]] [[dihedral angle]]s in unassigned regions. Both SVMs<ref name="Zimmermann">{{cite journal |vauthors=Zimmermann O, Hansmann UH |title=Support vector machines for prediction of dihedral angle regions |journal=Bioinformatics |volume=22 |issue=24 |pages=3009–15 |date=December 2006 |pmid=17005536 |doi=10.1093/bioinformatics/btl489 |doi-access=}}</ref> and neural networks<ref name="Kuang">{{cite journal |vauthors=Kuang R, Leslie CS, Yang AS |title=Protein backbone angle prediction with machine learning approaches |journal=Bioinformatics |volume=20 |issue=10 |pages=1612–21 |date=July 2004 |pmid=14988121 |doi=10.1093/bioinformatics/bth136 |doi-access=free}}</ref> have been applied to this problem.<ref name="Pham"/> More recently, real-value torsion angles can be accurately predicted by SPINE-X and successfully employed for ab initio structure prediction.<ref name="torsion">{{cite journal |vauthors=Faraggi E, Yang Y, Zhang S, Zhou Y |title=Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction |journal=Structure |volume=17 |issue=11 |pages=1515–27 |date=November 2009 |pmid=19913486 |pmc=2778607 |doi=10.1016/j.str.2009.09.006}}</ref>

===Other improvements===
It is reported that in addition to the protein sequence, secondary structure formation depends on other factors. For example, it is reported that secondary structure tendencies depend also on local environment,<ref name="a0">{{cite journal |vauthors=Zhong L, Johnson WC |title=Environment affects amino acid preference for secondary structure |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=89 |issue=10 |pages=4462–5 |date=May 1992 |pmid=1584778 |pmc=49102 |doi=10.1073/pnas.89.10.4462 |bibcode=1992PNAS...89.4462Z |doi-access=free}}</ref> solvent accessibility of residues,<ref name="a1">{{cite journal |vauthors=Macdonald JR, Johnson WC |title=Environmental features are important in determining protein secondary structure |journal=Protein Science |volume=10 |issue=6 |pages=1172–7 |date=June 2001 |pmid=11369855 |pmc=2374018 |doi=10.1110/ps.420101}}</ref> protein structural class,<ref name="a2">{{cite journal |vauthors=Costantini S, Colonna G, Facchiano AM |title=Amino acid propensities for secondary structures are influenced by the protein structural class |journal=Biochemical and Biophysical Research Communications |volume=342 |issue=2 |pages=441–51 |date=April 2006 |pmid=16487481 |doi=10.1016/j.bbrc.2006.01.159}}</ref> and even the organism from which the proteins are obtained.<ref name="a3">{{cite journal |vauthors=Marashi SA, Behrouzi R, Pezeshk H |title=Adaptation of proteins to different environments: a comparison of proteome structural properties in Bacillus subtilis and Escherichia coli |journal=Journal of Theoretical Biology |volume=244 |issue=1 |pages=127–32 |date=January 2007 |pmid=16945389 |doi=10.1016/j.jtbi.2006.07.021 |bibcode=2007JThBi.244..127M}}</ref> Based on such observations, some studies have shown that secondary structure prediction can be improved by addition of information about protein structural class,<ref name="m">{{cite journal |vauthors=Costantini S, Colonna G, Facchiano AM |title=PreSSAPro: a software for the prediction of secondary structure by amino acid properties |journal=Computational Biology and Chemistry |volume=31 |issue=5–6 |pages=389–92 |date=October 2007 |pmid=17888742 |doi=10.1016/j.compbiolchem.2007.08.010}}</ref> residue accessible surface area<ref name="P">{{cite journal |vauthors=Momen-Roknabadi A, Sadeghi M, Pezeshk H, Marashi SA |title=Impact of residue accessible surface area on the prediction of protein secondary structures |journal=BMC Bioinformatics |volume=9 |pages=357 |date=August 2008 |pmid=18759992 |pmc=2553345 |doi=10.1186/1471-2105-9-357 |doi-access=free}}</ref><ref name="Ph">{{cite journal |vauthors=Adamczak R, Porollo A, Meller J |title=Combining prediction of secondary structure and solvent accessibility in proteins |journal=Proteins |volume=59 |issue=3 |pages=467–75 |date=May 2005 |pmid=15768403 |doi=10.1002/prot.20441 |s2cid=13267624}}</ref> and also [[contact number]] information.<ref name="az">{{cite journal |vauthors=Lakizadeh A, Marashi SA |year=2009 |url=http://www.excli.de/vol8/lakizadeh_03_2009/lakizadeh_250309a_proof.pdf |title=Addition of contact number information can improve protein secondary structure prediction by neural networks |journal=Excli J. |volume=8 |pages=66–73}}</ref>