Editing Protein structure prediction (section)

==Tertiary structure==
{{Main|homology modeling|fold recognition}}
The practical role of protein structure prediction is now more important than ever.<ref>{{Cite journal|last1=Dorn|first1=Márcio|last2=e Silva|first2=Mariel Barbachan|last3=Buriol|first3=Luciana S.|last4=Lamb|first4=Luis C.|date=2014-12-01|title=Three-dimensional protein structure prediction: Methods and computational strategies|url=http://www.sciencedirect.com/science/article/pii/S1476927114001248|journal=Computational Biology and Chemistry|language=en|volume=53|pages=251–276|doi=10.1016/j.compbiolchem.2014.10.001|pmid=25462334|issn=1476-9271|url-access=subscription}}</ref> Massive amounts of protein sequence data are produced by modern large-scale [[DNA]] sequencing efforts such as the [[Human Genome Project]]. Despite community-wide efforts in [[structural genomics]], the output of experimentally determined protein structures—typically by time-consuming and relatively expensive [[X-ray crystallography]] or [[Protein NMR|NMR spectroscopy]]—is lagging far behind the output of protein sequences.

The protein structure prediction remains an extremely difficult and unresolved undertaking. The two main problems are the calculation of [[Gibbs free energy|protein free energy]] and [[energy minimization|finding the global minimum]] of this energy. A protein structure prediction method must explore the space of possible protein structures which is [[Levinthal's paradox|astronomically large]]. These problems can be partially bypassed in "comparative" or [[homology modeling]] and [[fold recognition]] methods, in which the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. In contrast, the [[de novo protein structure prediction]] methods must explicitly resolve these problems. The progress and challenges in protein structure prediction have been reviewed by Zhang.<ref name="zhang2008"/>

===Before modelling===
Most tertiary structure modelling methods, such as Rosetta, are optimized for modelling the tertiary structure of single protein domains. A step called '''domain parsing''', or '''domain boundary prediction''', is usually done first to split a protein into potential structural domains. As with the rest of tertiary structure prediction, this can be done comparatively from known structures<ref>{{cite journal |vauthors=Ovchinnikov S, Kim DE, Wang RY, Liu Y, DiMaio F, Baker D |title=Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta |journal=Proteins |volume=84 |pages=67–75 |date=September 2016 |issue=Suppl 1 |pmid=26677056 |doi=10.1002/prot.24974 |pmc=5490371}}</ref> or ''ab initio'' with the sequence only (usually by [[machine learning]], assisted by covariation).<ref>{{cite journal |vauthors=Hong SH, Joo K, Lee J |title=ConDo: Protein domain boundary prediction using coevolutionary information |journal=Bioinformatics |volume=35 |issue=14 |pages=2411–2417 |date=November 2018 |pmid=30500873 |doi=10.1093/bioinformatics/bty973}}</ref> The structures for individual domains are docked together in a process called '''domain assembly''' to form the final tertiary structure.<ref>{{cite journal |vauthors=Wollacott AM, Zanghellini A, Murphy P, Baker D |title=Prediction of structures of multidomain proteins from structures of the individual domains |journal=Protein Science |volume=16 |issue=2 |pages=165–75 |date=February 2007 |pmid=17189483 |doi=10.1110/ps.062270707 |pmc=2203296}}</ref><ref>{{cite journal |vauthors=Xu D, Jaroszewski L, Li Z, Godzik A |title=AIDA: ab initio domain assembly for automated multi-domain protein structure prediction and domain-domain interaction prediction |journal=Bioinformatics |volume=31 |issue=13 |pages=2098–105 |date=July 2015 |pmid=25701568 |doi=10.1093/bioinformatics/btv092 |pmc=4481839}}</ref>

===''Ab initio'' protein modelling===
{{Main|De novo protein structure prediction}}

====Energy- and fragment-based methods====
''Ab initio''- or ''de novo''- protein modelling methods seek to build three-dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many possible procedures that either attempt to mimic [[protein folding]] or apply some [[stochastic]] method to search possible solutions (i.e., [[global optimization]] of a suitable energy function). These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. To predict protein structure ''de novo'' for larger proteins will require better algorithms and larger computational resources like those afforded by either powerful supercomputers (such as [[Blue Gene]] or [[MDGRAPE-3]]) or distributed computing (such as [[Folding@home]], the [[Human Proteome Folding Project]] and [[Rosetta@Home]]). Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ''ab initio'' structure prediction an active research field.<ref name="zhang2008">{{cite journal |vauthors=Zhang Y |title=Progress and challenges in protein structure prediction |journal=Current Opinion in Structural Biology |volume=18 |issue=3 |pages=342–8 |date=June 2008 |pmid=18436442 |pmc=2680823 |doi=10.1016/j.sbi.2008.02.004}}</ref>

As of 2009, a 50-residue protein could be simulated atom-by-atom on a supercomputer for 1 millisecond.<ref name="ShawBowers2009">{{cite conference| vauthors=Shaw DE, Dror RO, Salmon JK, Grossman JP, Mackenzie KM, Bank JA, Young C, Deneroff MM, Batson B, Bowers KJ, Chow E |conference=Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis – SC '09 |year=2009|pages=1|doi=10.1145/1654059.1654126|title=Millisecond-scale molecular dynamics simulations on Anton|isbn=9781605587448|doi-access=}}</ref> As of 2012, comparable stable-state sampling could be done on a standard desktop with a new graphics card and more sophisticated algorithms.<ref name="PierceSalomon-Ferrer2012">{{cite journal |vauthors=Pierce LC, Salomon-Ferrer R, de Oliveira CA, McCammon JA, Walker RC |title=Routine Access to Millisecond Time Scale Events with Accelerated Molecular Dynamics |journal=Journal of Chemical Theory and Computation |volume=8 |issue=9 |pages=2997–3002 |date=September 2012 |pmid=22984356 |pmc=3438784 |doi=10.1021/ct300284c}}</ref> A much larger simulation timescales can be achieved using [[coarse-grained modeling]].<ref>{{cite journal |vauthors=Kmiecik S, Gront D, Kolinski M, Wieteska L, Dawid AE, Kolinski A |title=Coarse-Grained Protein Models and Their Applications |journal=Chemical Reviews |volume=116 |issue=14 |pages=7898–936 |date=July 2016 |pmid=27333362 |doi=10.1021/acs.chemrev.6b00163 |doi-access=free}}</ref><ref name="denovo2018">{{cite journal |vauthors=Cheung NJ, Yu W |title=De novo protein structure prediction using ultra-fast molecular dynamics simulation |journal=PLOS ONE |volume=13| issue=11 |pages=e0205819 |date=November 2018 |pmid=30458007 |pmc=6245515 |doi=10.1371/journal.pone.0205819 |bibcode=2018PLoSO..1305819C |doi-access=free}}</ref>

====Evolutionary covariation to predict 3D contacts====
As sequencing became more commonplace in the 1990s several groups used protein sequence alignments to predict correlated [[mutation]]s and it was hoped that these coevolved residues could be used to predict tertiary structure (using the analogy to distance constraints from experimental procedures such as [[NMR]]). The assumption is when single residue mutations are slightly deleterious, compensatory mutations may occur to restabilize residue-residue interactions.
This early work used what are known as ''local'' methods to calculate correlated mutations from protein sequences, but suffered from indirect false correlations which result from treating each pair of residues as independent of all other pairs.<ref>{{cite journal |vauthors=Göbel U, Sander C, Schneider R, Valencia A |title=Correlated mutations and residue contacts in proteins |journal=Proteins |volume=18 |issue=4 |pages=309–17 |date=April 1994 |pmid=8208723 |doi=10.1002/prot.340180402 |s2cid=14978727}}</ref><ref>{{cite journal |vauthors=Taylor WR, Hatrick K |title=Compensating changes in protein multiple sequence alignments |journal=Protein Engineering |volume=7 |issue=3 |pages=341–8 |date=March 1994 |pmid=8177883 |doi=10.1093/protein/7.3.341}}</ref><ref>{{cite journal |vauthors=Neher E |title=How frequent are correlated changes in families of protein sequences? |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=91 |issue=1 |pages=98–102 |date=January 1994 |pmid=8278414 |pmc=42893 |doi=10.1073/pnas.91.1.98 |bibcode=1994PNAS...91...98N |doi-access=free}}</ref>

In 2011, a different, and this time ''global'' statistical approach, demonstrated that predicted coevolved residues were sufficient to predict the 3D fold of a protein, providing there are enough sequences available (>1,000 homologous sequences are needed).<ref name="marks">{{cite journal |vauthors=Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C |title=Protein 3D structure computed from evolutionary sequence variation |journal=PLOS ONE |volume=6 |issue=12 |pages=e28766 |year=2011 |pmid=22163331 |pmc=3233603 |doi=10.1371/journal.pone.0028766 |bibcode=2011PLoSO...628766M |doi-access=free}}</ref> The method, [http://evfold.org EVfold], uses no homology modeling, threading or 3D structure fragments and can be run on a standard personal computer even for proteins with hundreds of residues. The accuracy of the contacts predicted using this and related approaches has now been demonstrated on many known structures and contact maps,<ref>{{cite journal |vauthors=Burger L, van Nimwegen E |title=Disentangling direct from indirect co-evolution of residues in protein alignments |journal=PLOS Computational Biology |volume=6 |issue=1 |pages=e1000633 |date=January 2010 |pmid=20052271 |pmc=2793430 |doi=10.1371/journal.pcbi.1000633 |bibcode=2010PLSCB...6E0633B |doi-access=free}}</ref><ref>{{cite journal |vauthors=Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M |title=Direct-coupling analysis of residue coevolution captures native contacts across many protein families |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=108 |issue=49 |pages=E1293-301 |date=December 2011 |pmid=22106262 |pmc=3241805 |doi=10.1073/pnas.1111471108 |arxiv=1110.5223 |bibcode=2011PNAS..108E1293M |doi-access=free}}</ref><ref>{{cite journal |vauthors=Nugent T, Jones DT |title=Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=109 |issue=24 |pages=E1540-7 |date=June 2012 |pmid=22645369 |pmc=3386101 |doi=10.1073/pnas.1120036109 |bibcode=2012PNAS..109E1540N |doi-access=free}}</ref> including the prediction of experimentally unsolved transmembrane proteins.<ref>{{cite journal |vauthors=Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS |title=Three-dimensional structures of membrane proteins from genomic sequencing |journal=Cell |volume=149 |issue=7 |pages=1607–21 |date=June 2012 |pmid=22579045 |pmc=3641781 |doi=10.1016/j.cell.2012.04.012}}</ref>

===Comparative protein modeling===

Comparative protein modeling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of [[tertiary structure|tertiary]] [[structural motif]]s to which most proteins belong. It has been suggested that there are only around 2,000 distinct protein folds in nature, though there are many millions of different proteins. The comparative protein modeling can combine with the evolutionary covariation in the structure prediction.<ref>{{cite journal |last1=Jin |first1=Shikai |last2=Chen |first2=Mingchen |last3=Chen |first3=Xun |last4=Bueno |first4=Carlos |last5=Lu |first5=Wei |last6=Schafer |first6=Nicholas P. |last7=Lin |first7=Xingcheng |last8=Onuchic |first8=José N. |last9=Wolynes |first9=Peter G. |title=Protein Structure Prediction in CASP13 Using AWSEM-Suite |journal=Journal of Chemical Theory and Computation |date=9 June 2020 |volume=16 |issue=6 |pages=3977–3988 |doi=10.1021/acs.jctc.0c00188|pmid=32396727 |s2cid=218618842}}</ref>

These methods may also be split into two groups:<ref name="zhang2008"/>
* [[Homology modeling]] is based on the reasonable assumption that two [[Homology (biology)#Homology of sequences in genetics|homologous]] proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through [[sequence alignment]]. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment.<ref name="zhang2005">{{cite journal |vauthors=Zhang Y, Skolnick J |title=The protein structure prediction problem could be solved using the current PDB library |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=102 |issue=4 |pages=1029–34 |date=January 2005 |pmid=15653774 |pmc=545829 |doi=10.1073/pnas.0407152101 |bibcode=2005PNAS..102.1029Z |doi-access=free}}</ref> Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences.
* [[Threading (protein sequence)|Protein threading]]<ref name="bowie1991">{{cite journal |vauthors=Bowie JU, Lüthy R, Eisenberg D |title=A method to identify protein sequences that fold into a known three-dimensional structure |journal=Science |volume=253 |issue=5016 |pages=164–70 |date=July 1991 |pmid=1853201 |doi=10.1126/science.1853201 |bibcode=1991Sci...253..164B}}</ref> scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a [[Statistical potential|scoring function]] is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This type of method is also known as '''3D-1D fold recognition''' due to its compatibility analysis between three-dimensional structures and linear protein sequences. This method has also given rise to methods performing an '''inverse folding search''' by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold.

===Modeling of side-chain conformations===
Accurate packing of the amino acid [[side chain]]s represents a separate problem in protein structure prediction. Methods that specifically address the problem of predicting side-chain geometry include [[dead-end elimination]] and the [[self-consistent mean field (biology)|self-consistent mean field]] methods. The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "[[rotamer]]s". The methods attempt to identify the set of rotamers that minimize the model's overall energy.

These methods use rotamer libraries, which are collections of favorable conformations for each residue type in proteins. Rotamer libraries may contain information about the conformation, its frequency, and the standard deviations about mean dihedral angles, which can be used in sampling.<ref name="Rotamers21stCentury">{{cite journal |vauthors=Dunbrack RL |title=Rotamer libraries in the 21st century |journal=Current Opinion in Structural Biology |volume=12 |issue=4 |pages=431–40 |date=August 2002 |pmid=12163064 |doi=10.1016/S0959-440X(02)00344-5}}</ref> Rotamer libraries are derived from [[structural bioinformatics]] or other statistical analysis of side-chain conformations in known experimental structures of proteins, such as by clustering the observed conformations for tetrahedral carbons near the staggered (60°, 180°, −60°) values.

Rotamer libraries can be backbone-independent, secondary-structure-dependent, or backbone-dependent. Backbone-independent rotamer libraries make no reference to backbone conformation, and are calculated from all available side chains of a certain type (for instance, the first example of a rotamer library, done by Ponder and [[Frederic M. Richards|Richards]] at Yale in 1987).<ref>{{cite journal |vauthors=Ponder JW, Richards FM |title=Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes |journal=Journal of Molecular Biology |volume=193 |issue=4 |pages=775–91 |date=February 1987 |pmid=2441069 |doi=10.1016/0022-2836(87)90358-5}}</ref> Secondary-structure-dependent libraries present different dihedral angles and/or rotamer frequencies for <math>\alpha</math>-helix, <math>\beta</math>-sheet, or coil secondary structures.<ref>{{cite journal |vauthors=Lovell SC, Word JM, Richardson JS, Richardson DC |title=The penultimate rotamer library |journal=Proteins |volume=40 |issue=3 |pages=389–408 |date=August 2000 |pmid=10861930 |doi=10.1002/1097-0134(20000815)40:3<389::AID-PROT50>3.0.CO;2-2 |s2cid=3055173}}</ref> [[Backbone-dependent rotamer library|Backbone-dependent rotamer libraries]] present conformations and/or frequencies dependent on the local backbone conformation as defined by the backbone dihedral angles <math>\phi</math> and <math>\psi</math>, regardless of secondary structure.<ref name="bbdep2010">{{cite journal |vauthors=Shapovalov MV, Dunbrack RL |title=A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions |journal=Structure |volume=19 |issue=6 |pages=844–58 |date=June 2011 |pmid=21645855 |pmc=3118414 |doi=10.1016/j.str.2011.03.019}}</ref>

The modern versions of these libraries as used in most software are presented as multidimensional distributions of probability or frequency, where the peaks correspond to the dihedral-angle conformations considered as individual rotamers in the lists. Some versions are based on very carefully curated data and are used primarily for structure validation,<ref>{{cite journal |vauthors=Chen VB, Arendall WB, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC |title=MolProbity: all-atom structure validation for macromolecular crystallography |journal=Acta Crystallographica. Section D, Biological Crystallography |volume=66 |issue=Pt 1 |pages=12–21 |date=January 2010 |pmid=20057044 |pmc=2803126 |doi=10.1107/S0907444909042073|bibcode=2010AcCrD..66...12C }}</ref> while others emphasize relative frequencies in much larger data sets and are the form used primarily for structure prediction, such as the [[Backbone-dependent rotamer library|Dunbrack rotamer libraries]].<ref>{{cite journal |vauthors=Bower MJ, Cohen FE, Dunbrack RL |title=Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool |journal=Journal of Molecular Biology |volume=267 |issue=5 |pages=1268–82 |date=April 1997 |pmid=9150411 |doi=10.1006/jmbi.1997.0926}}</ref>

Side-chain packing methods are most useful for analyzing the protein's [[hydrophobic]] core, where side chains are more closely packed; they have more difficulty addressing the looser constraints and higher flexibility of surface residues, which often occupy multiple rotamer conformations rather than just one.<ref name="voigt2000">{{cite journal |vauthors=Voigt CA, Gordon DB, Mayo SL |title=Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design |journal=Journal of Molecular Biology |volume=299 |issue=3 |pages=789–803 |date=June 2000 |pmid=10835284 |doi=10.1006/jmbi.2000.3758 |citeseerx=10.1.1.138.2023}}</ref><ref name="scwrl4">{{cite journal |vauthors=Krivov GG, Shapovalov MV, Dunbrack RL |title=Improved prediction of protein side-chain conformations with SCWRL4 |journal=Proteins |volume=77 |issue=4 |pages=778–95 |date=December 2009 |pmid=19603484 |pmc=2885146 |doi=10.1002/prot.22488}}</ref>