Editing Sequence alignment (section)

==Multiple sequence alignment==
{{Main|Multiple sequence alignment}}
[[Image:Hemagglutinin-alignments.png|right|thumb|300px|Alignment of 27 [[avian influenza]] [[hemagglutinin]] protein sequences colored by residue conservation (top) and residue properties (bottom)]]
[[Multiple sequence alignment]] is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying [[conservation (genetics)|conserved]] sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and [[reaction mechanism|mechanistic]] information to locate the catalytic [[active site]]s of [[enzyme]]s. Alignments are also used to aid in establishing evolutionary relationships by constructing [[phylogenetic tree]]s. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to [[NP-complete]] combinatorial optimization problems.<ref name=wang>{{cite journal | journal=J Comput Biol | volume=1 | pages=337–48 | year=1994  |author1=Wang L |author2=Jiang T. | title=On the complexity of multiple sequence alignment | pmid=8790475 | doi = 10.1089/cmb.1994.1.337| issue=4  | citeseerx=10.1.1.408.894 }}</ref><ref name=elias>{{cite journal | journal=J Comput Biol | volume=13 | pages=1323–1339 | year=2006  | author=Elias, Isaac | title=Settling the intractability of multiple alignment | pmid=17037961 | doi =10.1089/cmb.2006.13.1323 | issue=7  | citeseerx=10.1.1.6.256 }}</ref> Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.

===Dynamic programming===
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and [[computer memory|memory]], it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" [[objective function]], has been implemented in the [https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA] software package.<ref name=lipman>{{cite journal | journal=Proc Natl Acad Sci USA | volume=86 | pages=4412–5 | year=1989  |author1=Lipman DJ |author2=Altschul SF |author3=Kececioglu JD | title=A tool for multiple sequence alignment | pmid=2734293 | doi=10.1073/pnas.86.12.4412 | issue=12 | pmc=287279 | bibcode=1989PNAS...86.4412L | doi-access=free }}</ref>

===Progressive methods===
Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to [[FASTA]]. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.

Many variations of the [[Clustal]] progressive implementation<ref name=higgins>{{cite journal | journal=Gene | volume=73 | issue=1 | pages=237–44 | year=1988  | author=[[Desmond G. Higgins|Higgins DG]], Sharp PM | title=CLUSTAL: a package for performing multiple sequence alignment on a microcomputer | pmid=3243435 | doi = 10.1016/0378-1119(88)90330-7 }}</ref><ref name=thompson>{{cite journal | journal=Nucleic Acids Res | volume=22 | pages=4673–80 | year=1994  | author1=Thompson JD| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Gibson TJ. | title=CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice | pmid=7984417 |pmc=308517 |url=|doi=10.1093/nar/22.22.4673 | issue=22 }}</ref><ref name=chenna>{{cite journal | journal=Nucleic Acids Res | volume=31 | pages=3497–500 | year=2003  |author1=Chenna R |author2=Sugawara H |author3=Koike T |author4=Lopez R |author5=Gibson TJ |author6=Higgins DG |author7=Thompson JD. | title=Multiple sequence alignment with the Clustal series of programs | url= | pmid=12824352 | doi = 10.1093/nar/gkg500 | issue=13 | pmc=168907 }}</ref> are used for multiple sequence alignment, phylogenetic tree construction, and as input for [[protein structure prediction]]. A slower but more accurate variant of the progressive method is known as [[T-Coffee]].<ref name=notredame>{{cite journal | journal=J Mol Biol | volume=302 | issue=1 | pages=205–17 | year=2000  | author1=Notredame C| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Heringa J. | title=T-Coffee: A novel method for fast and accurate multiple sequence alignment | pmid=10964570 | doi = 10.1006/jmbi.2000.4042 | s2cid=10189971 }}</ref>

===Iterative methods===
Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an [[objective function]] based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.<ref name=hirosawa>{{cite journal | journal=Comput Appl Biosci | volume=11 | pages=13–8 | year=1995  |author1=Hirosawa M |author2=Totoki Y |author3=Hoshida M |author4=Ishikawa M. | title=Comprehensive study on iterative algorithms of multiple sequence alignment | pmid=7796270 | doi = 10.1093/bioinformatics/11.1.13 | issue=1 }}</ref>

===Motif finding===
Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved [[sequence motif]]s among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly [[conservation (genetics)|conserved]] regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original [[data set]] contained a small number of sequences, or only highly related sequences, [[pseudocount]]s are added to normalize the character distributions represented in the motif.

===Techniques inspired by computer science===
[[File:A profile HMM modelling a multiple sequence alignment.png|thumb|A profile HMM modelling a multiple sequence alignment]]

A variety of general [[Optimization (mathematics)|optimization]] algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. [[Hidden Markov model]]s have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.<ref name=karplus>{{cite journal | journal=Bioinformatics | volume=14 | issue=10 | pages= 846–856| year=1998 |author1=Karplus K |author2=Barrett C |author3=Hughey R. | title=Hidden Markov models for detecting remote protein homologies | pmid=9927713 | doi = 10.1093/bioinformatics/14.10.846 | doi-access=free | citeseerx=10.1.1.57.2762 }}</ref> [[Genetic algorithm]]s and [[simulated annealing]] have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article [[multiple sequence alignment]].

The [[Burrows–Wheeler transform]] has been successfully applied to fast short read alignment in popular tools such as [[Bowtie (sequence analysis)|Bowtie]] and BWA. See [[FM-index]].