Editing Protein design (section)

==Underlying models of protein structure and function ==
Protein design programs use [[bioinformatics|computer models]] of the molecular forces that drive proteins in ''[[in vivo]]'' environments. In order to make the problem tractable, these forces are simplified by protein design models. Although protein design programs vary greatly, they have to address four main modeling questions: What is the target structure of the design, what flexibility is allowed on the target structure, which sequences are included in the search, and which force field will be used to score sequences and structures.

===Target structure===
[[File:Top7.png|thumb|left|The [[Top7]] protein was one of the first proteins designed for a fold that had never been seen before in nature<ref name="kuhlman03">{{cite journal|last=Kuhlman|first=B|author2=Dantas, G |author3=Ireton, GC |author4=Varani, G |author5=Stoddard, BL |author6= Baker, D |title=Design of a novel globular protein fold with atomic-level accuracy.|journal=Science|date=November 21, 2003|volume=302|issue=5649|pages=1364–8|pmid=14631033|bibcode= 2003Sci...302.1364K |doi= 10.1126/science.1089427|s2cid=1939390}}</ref>]]

Protein function is heavily dependent on protein structure, and rational protein design uses this relationship to design function by designing proteins that have a target structure or fold. Thus, by definition, in rational protein design the target structure or ensemble of structures must be known beforehand. This contrasts with other forms of protein engineering, such as [[directed evolution]], where a variety of methods are used to find proteins that achieve a specific function, and with [[protein structure prediction]] where the sequence is known, but the structure is unknown.

Most often, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and coworkers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature.<ref name="gordon99review" /><ref name="harbury99" /> The protein Top7, developed in [[David Baker (biochemist)|David Baker]]'s lab, was designed completely using protein design algorithms, to a completely novel fold.<ref name="kuhlman03" /> More recently, Baker and coworkers developed a series of principles to design ideal [[globular protein|globular-protein]] structures based on [[folding funnel|protein folding funnels]] that bridge between secondary structure prediction and tertiary structures. These principles, which build on both protein structure prediction and protein design, were used to design five different novel protein topologies.<ref>{{cite journal|last=Höcker|first=B|title=Structural biology: A toolbox for protein design.|journal=Nature|date=November 8, 2012|volume=491|issue=7423|pages=204–5|pmid=23135466|bibcode= 2012Natur.491..204H |doi= 10.1038/491204a|s2cid=4426247|doi-access=free}}</ref>

===Sequence space===
[[File:1FSVblue-1ZAAred.png|thumb|FSD-1 (shown in blue, PDB id: 1FSV) was the first ''de novo'' computational design of a full protein.<ref name="dahiyat1997">{{cite journal|last=Dahiyat|first=BI|author2=Mayo, SL |title=De novo protein design: fully automated sequence selection.|journal=Science|date=October 3, 1997|volume=278|issue=5335|pages=82–7|pmid=9311930|doi=10.1126/science.278.5335.82|citeseerx=10.1.1.72.7304}}</ref> The target fold was that of the zinc finger in residues 33–60 of the structure of protein Zif268 (shown in red, PDB id: 1ZAA). The designed sequence had very little sequence identity with any known protein sequence.]]

In rational protein design, proteins can be redesigned from the sequence and structure of a known protein, or completely from scratch in ''de novo'' protein design. In protein redesign, most of the residues in the sequence are maintained as their wild-type amino-acid while a few are allowed to mutate. In ''de novo'' design, the entire sequence is designed anew, based on no prior sequence.

Both ''de novo'' designs and protein redesigns can establish rules on the [[Sequence space (evolution)|sequence space]]: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the [[#Protein resurfacing|RSC3 probe]] to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing.  Many of the earliest attempts on protein design were heavily based on empiric ''rules'' on the sequence space.<ref name="richardson1989" /> Moreover, the [[#Design of fibrous proteins|design of fibrous proteins]] usually follows strict rules on the sequence space. [[Collagen]]-based designed proteins, for example, are often composed of Gly-Pro-X repeating patterns.<ref name="richardson1989" /> The advent of computational techniques allows designing proteins with no human intervention in sequence selection.<ref name="dahiyat1997" />

===Structural flexibility===
[[File:ileRotamers.gif|thumb|left|200px|Common protein design programs use rotamer libraries to simplify the conformational space of protein side chains. This animation loops through all the rotamers of the isoleucine amino acid based on the Penultimate Rotamer Library (total of 7 rotamers).<ref name="lovell2000" />]]

In protein design, the target structure (or structures) of the protein are known. However, a rational protein design approach must model some ''flexibility'' on the target structure in order to increase the number of sequences that can be designed for that structure and to minimize the chance of a sequence folding to a different structure. For example, in a protein redesign of one small amino acid (such as alanine) in the tightly packed core of a protein, very few mutants would be predicted by a rational design approach to fold to the target structure, if the surrounding side-chains are not allowed to be repacked.

Thus, an essential parameter of any design process is the amount of flexibility allowed for both the side-chains and the backbone. In the simplest models, the protein backbone is kept rigid while some of the protein side-chains are allowed to change conformations. However, side-chains can have many degrees of freedom in their bond lengths, bond angles, and [[Dihedral angle#Dihedral angles of biological molecules|<var>&chi;</var> dihedral angles]]. To simplify this space, protein design methods use rotamer libraries that assume ideal values for bond lengths and bond angles, while restricting <var>&chi;</var> dihedral angles to a few frequently observed low-energy conformations termed [[Conformational isomerism|rotamers]].

Rotamer libraries are derived from the statistical analysis of many protein structures. Backbone-independent rotamer libraries describe all rotamers.<ref name="lovell2000">{{cite journal|last=Lovell|first=SC|author2=Word, JM |author3=Richardson, JS |author4= Richardson, DC |title=The penultimate rotamer library.|journal=Proteins|date=August 15, 2000|volume=40|issue=3|pages=389–408|pmid=10861930|doi=10.1002/1097-0134(20000815)40:3<389::AID-PROT50>3.0.CO;2-2|citeseerx=10.1.1.555.4071|s2cid=3055173 }}</ref> [[Backbone-dependent rotamer library|Backbone-dependent rotamer libraries]], in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain.<ref>{{cite journal|last=Shapovalov|first=MV|author2=Dunbrack RL, Jr|title=A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions.|journal=Structure|date=June 8, 2011|volume=19|issue=6|pages=844–58|pmid=21645855|doi=10.1016/j.str.2011.03.019|pmc=3118414}}</ref> Most protein design programs use one conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region.<ref name="samish11"/>

Although rational protein design must preserve the general backbone fold a protein, allowing some backbone flexibility can significantly increase the number of sequences that fold to the structure while maintaining the general fold of the protein.<ref name="kortemme09">{{cite journal|last=Mandell|first=DJ|author2=Kortemme, T |author-link2=Tanja Kortemme |title=Backbone flexibility in computational protein design.|journal=Current Opinion in Biotechnology|date=August 2009|volume=20|issue=4|pages=420–8|pmid=19709874|doi=10.1016/j.copbio.2009.07.006|url=https://escholarship.org/content/qt89b8n09b/qt89b8n09b.pdf?t=pqrxq4}}</ref> Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.<ref name="kortemme09" /><ref name="donald10" />

===Energy function===
[[File:PEF comparison.png|thumb|400px|right|Comparison of various potential energy functions. The most accurate energy are those that use quantum mechanical calculations, but these are too slow for protein design. On the other extreme, heuristic energy functions are based on statistical terms and are very fast. In the middle are molecular mechanics energy functions that are physically based but are not as computationally expensive as quantum mechanical simulations.<ref name="Boas"/>]]

Rational protein design techniques must be able to discriminate sequences that will be stable under the target fold from those that would prefer other low-energy competing states. Thus, protein design requires accurate [[force field (chemistry)|energy functions]] that can rank and score sequences by how well they fold to the target structure. At the same time, however, these energy functions must consider the computational [[#As an optimization problem|challenges]] behind protein design.  One of the most challenging requirements for successful design is an energy function that is both accurate and simple for computational calculations.

The most accurate energy functions are those based on quantum mechanical simulations. However, such simulations are too slow and typically impractical for protein design. Instead, many protein design algorithms use either physics-based energy functions adapted from [[molecular mechanics]] simulation programs, [[statistical potential|knowledge based energy-functions]], or a hybrid mix of both. The trend has been toward using more physics-based potential energy functions.<ref name="Boas">{{cite journal |last1=Boas |first1=F. E. |last2=Harbury |first2=P. B. |name-list-style=amp |year=2007 |title=Potential energy functions for protein design |journal=Current Opinion in Structural Biology |volume=17 |issue=2 |pages=199–204 |doi=10.1016/j.sbi.2007.03.006 |pmid=17387014}}</ref>

Physics-based energy functions, such as [[AMBER]] and [[CHARMM]], are typically derived from quantum mechanical simulations, and experimental data from thermodynamics, crystallography, and spectroscopy.<ref name="boas2007">{{cite journal|last=Boas|first=FE|author2=Harbury, PB |title=Potential energy functions for protein design.|journal=Current Opinion in Structural Biology|date=April 2007|volume=17|issue=2|pages=199–204|pmid=17387014|doi=10.1016/j.sbi.2007.03.006}}</ref> These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive [[Lennard-Jones]] term between atoms and a pairwise [[electrostatics]] coulombic term<ref>{{cite journal|last=Vizcarra|first=CL|author2=Mayo, SL |title=Electrostatics in computational protein design.|journal=Current Opinion in Chemical Biology|date=December 2005|volume=9|issue=6|pages=622–6|pmid=16257567|doi=10.1016/j.cbpa.2005.10.014}}</ref> between non-bonded atoms.

[[File:Water-hbond-vrc01-gp120.png|thumb|left|Water-mediated hydrogen bonds play a key role in protein–protein binding. One such interaction is shown between residues D457, S365 in the heavy chain of the HIV-broadly-neutralizing antibody VRC01 (green) and residues N58 and Y59 in the HIV envelope protein GP120 (purple).<ref name="wu2010">{{cite journal|last=Zhou|first=T|author2=Georgiev, I|author3=Wu, X|author4=Yang, ZY|author5=Dai, K|author6=Finzi, A|author7=Kwon, YD|author8=Scheid, JF|author9=Shi, W|author10=Xu, L|author11=Yang, Y|author12=Zhu, J|author13=Nussenzweig, MC|author14=Sodroski, J|author15=Shapiro, L|author16=Nabel, GJ|author17=Mascola, JR|author18=Kwong, PD|title=Structural basis for broad and potent neutralization of HIV-1 by antibody VRC01.|journal=Science|date=August 13, 2010|volume=329|issue=5993|pages=811–7|pmid=20616231|bibcode= 2010Sci...329..811Z |doi= 10.1126/science.1192819|pmc=2981354}}</ref>]]

Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure.<ref>{{cite journal|last=Mendes|first=J|author2=Guerois, R |author3=Serrano, L |title=Energy estimation in protein design.|journal=Current Opinion in Structural Biology|date=August 2002|volume=12|issue=4|pages=441–6|pmid=12163065|doi=10.1016/s0959-440x(02)00345-7}}</ref> These energy functions are [[:File:knowledge based potential.png|based on deriving energy values]] from frequency of appearance on a structural database.

Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been 
used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions often incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, and statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design.<ref name="boas2007" />

====Challenges for effective design energy functions====
Water makes up most of the molecules surrounding proteins and is the main driver of protein structure. Thus, modeling the interaction between water and protein is vital in protein design. The number of water molecules that interact with a protein at any given time is huge and each one has a large number of degrees of freedom and interaction partners. Instead, protein design programs model most of such water molecules as a continuum, modeling both the hydrophobic effect and solvation polarization.<ref name="boas2007" />

Individual water molecules can sometimes have a crucial structural role in the core of proteins, and in protein–protein or protein–ligand interactions. Failing to model such waters can result in mispredictions of the optimal sequence of a protein–protein interface. As an alternative, water molecules can be added to rotamers.
<!--
====Lennard-Jones potentials====

====Electrostatics====

====Entropy====

To be done.

====Non-pairwise terms====

Polarizability ... to be done.

====Knowledge-based energy functions====
--><ref name="boas2007" />

<!--
====Lennard-Jones potentials====

====Electrostatics====

====Entropy====

To be done.

====Non-pairwise terms====

Polarizability ... to be done.

====Knowledge-based energy functions====
-->