Editing Structural alignment (section)

==Methods==
Structural alignment techniques have been used in comparing individual structures or sets of structures as well as in the production of "all-to-all" comparison databases that measure the divergence between every pair of structures present in the [[Protein Data Bank]] (PDB). Such databases are used to classify proteins by their [[tertiary structure|fold]].

===DALI===
[[Image:Ssap-vectors.png|frame|class=skin-invert-image|Illustration of the atom-to-atom vectors calculated in SSAP. From these vectors a series of vector differences, e.g., between (FA) in Protein 1 and (SI) in Protein 2 would be constructed. The two sequences are plotted on the two dimensions of a matrix to form a difference matrix between the two proteins. Dynamic programming is applied to all possible difference matrices to construct a series of optimal local alignment paths that are then summed to form the summary matrix, on which a second round of dynamic programming is performed.]]A common and popular structural alignment method is the DALI, or Distance-matrix ALIgnment method, which breaks the input structures into hexapeptide fragments and calculates a distance matrix by evaluating the contact patterns between successive fragments.<ref name="holm"/> [[Secondary structure]] features that involve residues that are contiguous in sequence appear on the matrix's [[main diagonal]]; other diagonals in the matrix reflect spatial contacts between residues that are not near each other in the sequence. When these diagonals are parallel to the main diagonal, the features they represent are parallel; when they are perpendicular, their features are antiparallel. This representation is memory-intensive because the features in the square matrix are symmetrical (and thus redundant) about the main diagonal.

When two proteins' distance matrices share the same or similar features in approximately the same positions, they can be said to have similar folds with similar-length loops connecting their secondary structure elements. DALI's actual alignment process requires a similarity search after the two proteins' distance matrices are built; this is normally conducted via a series of overlapping submatrices of size 6x6. Submatrix matches are then reassembled into a final alignment via a standard score-maximization algorithm&nbsp;— the original version of DALI used a [[Monte Carlo method|Monte Carlo]] simulation to maximize a structural similarity score that is a function of the distances between putative corresponding atoms. In particular, more distant atoms within corresponding features are exponentially downweighted to reduce the effects of noise introduced by loop mobility, helix torsions, and other minor structural variations.<ref name="Mount" /> Because DALI relies on an all-to-all distance matrix, it can account for the possibility that structurally aligned features might appear in different orders within the two sequences being compared.

The DALI method has also been used to construct a database known as [[Families of structurally similar proteins|FSSP]] (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins) in which all known protein structures are aligned with each other to determine their structural neighbors and fold classification. There is a [http://ekhidna.biocenter.helsinki.fi/dali searchable database] based on DALI as well as a [http://ekhidna.biocenter.helsinki.fi/dali/README.v5.html downloadable program] and [http://ekhidna.biocenter.helsinki.fi/dali web search] based on a standalone version known as DaliLite.

===Combinatorial extension===
The combinatorial extension (CE) method is similar to DALI in that it too breaks each structure in the query set into a series of fragments that it then attempts to reassemble into a complete alignment. A series of pairwise combinations of fragments called aligned fragment pairs, or AFPs, are used to define a similarity matrix through which an optimal path is generated to identify the final alignment. Only AFPs that meet given criteria for local similarity are included in the matrix as a means of reducing the necessary search space and thereby increasing efficiency.<ref name="shindyalov">{{cite journal
| pmid=9796821
| first = I.N.
| last = Shindyalov
|author2=Bourne P.E.
 | year = 1998
| journal = Protein Engineering
| volume=11
| title=Protein structure alignment by incremental combinatorial extension (CE) of the optimal path
| issue = 9
| pages = 739–747
| doi=10.1093/protein/11.9.739
| doi-access = free
}}</ref> A number of similarity metrics are possible; the original definition of the CE method included only structural superpositions and inter-residue distances but has since been expanded to include local environmental properties such as secondary structure, solvent exposure, hydrogen-bonding patterns, and [[dihedral angle]]s.<ref name="shindyalov" />

An alignment path is calculated as the optimal path through the similarity matrix by linearly progressing through the sequences and extending the alignment with the next possible high-scoring AFP pair. The initial AFP pair that nucleates the alignment can occur at any point in the sequence matrix. Extensions then proceed with the next AFP that meets given distance criteria restricting the alignment to low gap sizes. The size of each AFP and the maximum gap size are required input parameters but are usually set to empirically determined values of 8 and 30 respectively.<ref name="shindyalov" /> Like DALI and SSAP, CE has been used to construct an all-to-all fold classification [http://cl.sdsc.edu/ database] {{Webarchive|url=https://web.archive.org/web/19981203071023/http://cl.sdsc.edu/ |date=1998-12-03 }} from the known protein structures in the PDB.

The [[Protein Data Bank|RCSB PDB]] has recently released an updated version of CE, Mammoth, and FATCAT as part of the [http://www.rcsb.org/pdb/workbench/workbench.do RCSB PDB Protein Comparison Tool]. It provides a new variation of CE that can detect [[Circular Permutation Proteins|circular permutations]] in protein structures.<ref name="prlic"/>

===Mammoth===
MAMMOTH <ref name="Mammoth">{{ cite journal
| pmid=12381844
| first= AR
| last = Ortiz
| author2 = Strauss CE 
| author3 = Olmea O. 
| title=MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. 
| journal=Protein Science
| year=2002
|volume=11 
| issue=11
|pages=2606–2621
|doi=10.1110/ps.0215902
| pmc= 2373724
|doi-access=free
}}</ref> approaches the alignment problem from a different objective than almost all other methods.  Rather than trying to find an alignment that maximally superimposes the largest number of residues, it seeks the subset of the structural alignment least likely to occur by chance. To do this it marks a local motif alignment with flags to indicate which residues simultaneously satisfy more stringent criteria: 1) Local structure overlap 2) regular secondary structure 3) 3D-superposition 4) same ordering in primary sequence. It converts the statistics of the number of residues with high-confidence matches and the size of the protein to compute an Expectation value for the outcome by chance. It excels at matching remote homologs, particularly structures generated by ab initio structure prediction to structure families such as SCOP, because it emphasizes extracting a statistically reliable sub alignment and not in achieving the maximal sequence alignment or maximal 3D superposition.<ref name="Malmstrom" /><ref name="robetta">{{cite journal
|journal=Nucleic Acids Research
|year= 2004 
|volume= 32(Web Server issue): W526–W531
|doi= 10.1093/nar/gkh468
|pmid= 15215442
|title=Protein structure prediction and analysis using the Robetta server
|author1=David E. Kim |author2=Dylan Chivian |author3=David Baker
|issue= Web Server issue 
|pages= W526–W531 
|pmc= 441606 
|doi-access= free
}}</ref>

For every overlapping window of 7 consecutive residues it computes the set of displacement direction unit vectors between adjacent C-alpha residues.  All-against-all local motifs are compared based on the URMS score.  These values becomes the pair alignment score entries for dynamic programming which produces a seed pair-wise residue alignment. The second phase uses a modified MaxSub algorithm: a single 7 reside aligned pair in each proteins is used to orient the two full length protein structures to maximally superimpose these just these 7 C-alpha, then in this orientation it scans for any additional aligned pairs that are close in 3D. It re-orients the structures to superimpose this expanded set and iterates until no more pairs coincide in 3D. This process is restarted for every 7 residue window in the seed alignment. The output is the maximal number of atoms found from any of these initial seeds. This statistic is converted to a calibrated E-value for the similarity of the proteins.

Mammoth makes no attempt to re-iterate the initial alignment or extend the high quality sub-subset. Therefore, the seed alignment it displays can't be fairly compared to DALI or TM align as it was formed simply as a heuristic to prune the search space.  (It can be used if one wants an alignment based solely on local structure-motif similarity agnostic of long range rigid body atomic alignment.)  Because of that same parsimony, it is well over ten times faster than DALI, CE and TM-align.<ref name="foldclass">{{cite journal
|title=Efficient SCOP-fold classification and retrieval using index-based protein substructure alignments
 |author1=Pin-Hao Chi |author2=Bin Pang |author3=Dmitry Korkin |author4=Chi-Ren Shyu
|journal=Bioinformatics
|volume=25
| issue=19
|year=2009
|pages=2559–2565
|doi=10.1093/bioinformatics/btp474
|pmid=19667079
|doi-access=free
}}</ref> It is often used in conjunction with these slower tools to pre-screen large data bases to extract the just the best E-value related structures for more exhaustive superposition or expensive calculations. 
<ref name="grishin04">{{cite journal 
|journal=BMC Bioinformatics
|year= 2004
|volume= 5
|issue= 197
| doi=10.1186/1471-2105-5-197
|pmid= 15598351
|title=SCOPmap: Automated assignment of protein structures to evolutionary superfamilies
 |author1=Sara Cheek |author2=Yuan Qi |author3=Sri Krishna |author4=Lisa N Kinch |author5=Nick V Grishin
|page= 197
|pmc= 544345
|doi-access=free
}}</ref>
<ref name="fssa">{{cite journal
|title=FSSA: a novel method for identifying functional signatures from structural alignments
 |author1=Kai Wang |author2=Ram Samudrala
|journal=Bioinformatics
|year=2005
|volume=21
|issue=13
|pages=2969–2977
|doi=10.1093/bioinformatics/bti471
|pmid=15860561
|doi-access=free
}}</ref>

It has been particularly successful at analyzing "decoy" structures from ab initio structure prediction.<ref name="casp11">{{cite journal
|vauthors=Kryshtafovych A, Monastyrskyy B, Fidelis K 
|title=CASP11 statistics and the prediction center evaluation system. \
|journal=Proteins
|year= 2016
|volume=84 
|issue=Suppl 1
|pages=(Suppl 1):15–19 
| doi=10.1002/prot.25005 
|pmid=26857434
|pmc=5479680
|doi-access=free 
}}</ref><ref name="Malmstrom" /><ref name="robetta" /> These decoys are notorious for getting local fragment motif structure correct, and forming some kernels of correct 3D tertiary structure but getting the full length tertiary structure wrong. In this twilight remote homology regime, Mammoth's e-values for the CASP<ref name="casp11" /> protein structure prediction evaluation have been shown to be significantly more correlated with human ranking than SSAP or DALI.<ref name=Mammoth /> Mammoths ability to extract the multi-criteria partial overlaps with proteins of known structure and rank these with proper E-values, combined with its speed facilitates scanning vast numbers of decoy models against the PDB data base for identifying the most likely correct decoys based on their remote homology to known proteins. 
<ref name="Malmstrom">{{cite journal
|title=Superfamily Assignments for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology 
 |author1=Lars Malmström Michael Riffle |author2=Charlie EM Strauss |author3=Dylan Chivian |author4=Trisha N Davis |author5=Richard Bonneau |author6=David Baker
|year=2007
|journal=PLOS Biol
| volume=5
|issue=4
|pages= e76corresponding author1,2
|doi=10.1371/journal.pbio.0050076
| pmid=17373854
| pmc=1828141
|doi-access=free
}}</ref>

===SSAP===
The SSAP (Sequential Structure Alignment Program) method uses double [[dynamic programming]] to produce a structural alignment based on atom-to-atom [[Vector (geometric)|vectors]] in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the [[beta carbon]]s for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment.

SSAP originally produced only pairwise alignments but has since been extended to multiple alignments as well.<ref name="taylor"/> It has been applied in an all-to-all fashion to produce a hierarchical fold classification scheme known as [[CATH]] (Class, Architecture, Topology, Homology),<ref name="orengo"/> which has been used to construct the [https://web.archive.org/web/20070517161248/http://www.cathdb.info/latest/index.html CATH Protein Structure Classification] database.