Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sequence alignment
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Process in bioinformatics that identifies equivalent sites within molecular sequences}} {{Use dmy dates|date=April 2017}} {{More citations needed|date=March 2009}} In [[bioinformatics]], a '''sequence alignment''' is a way of arranging the sequences of [[DNA]], [[RNA]], or protein to identify regions of similarity that may be a consequence of functional, [[structural biology|structural]], or [[evolution]]ary relationships between the sequences.<ref name=mount>{{cite book| author=Mount DM.| year=2004 | title=Bioinformatics: Sequence and Genome Analysis |edition=2nd | publisher= Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. |isbn=978-0-87969-608-5}}</ref> Aligned sequences of [[nucleotide]] or [[amino acid]] residues are typically represented as rows within a [[matrix (mathematics)|matrix]]. Gaps are inserted between the [[Residue (chemistry)|residues]] so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the [[Edit distance|distance cost]] between strings in a [[natural language]], or to display financial data. [[File:Histone Alignment.png|thumb|595px|A sequence alignment, produced by [[ClustalO]], of mammalian [[histone]] proteins. <br /> Sequences are the [[Amino acid#Table of standard amino acid abbreviations and properties|amino acids]] for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting [[conserved sequence]] (*), [[conservative mutation]]s (:), semi-conservative mutations (.), and [[segregating site|non-conservative mutations]] ( ).<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#23|website=Clustal|title=Clustal FAQ #Symbols|access-date=8 December 2014|archive-url=https://web.archive.org/web/20161024045656/http://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#23|archive-date=24 October 2016|url-status=dead}}</ref>]] ==Interpretation== If two sequences in an alignment share a common ancestor, mismatches can be interpreted as [[point mutation]]s and gaps as [[indel]]s (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between [[amino acid]]s occupying a particular position in the sequence can be interpreted as a rough measure of how [[conservation (genetics)|conserved]] a particular region or [[sequence motif]] is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose [[side chain]]s have similar biochemical properties) in a particular region of the sequence, suggest <ref name=predict>{{cite journal |author1=Ng PC |author2=Henikoff S |date=May 2001 | title = Predicting deleterious amino acid substitutions | journal = Genome Res | volume = 11 | issue = 5| pages = 863–74 | pmid = 11337480 | doi=10.1101/gr.176601 | pmc=311071}}</ref> that this region has structural or functional importance. Although DNA and RNA [[nucleotide]] bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role. ==Alignment methods== Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Various algorithms were devised to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: ''global alignments'' and ''local alignments''. Calculating a global alignment is a form of [[global optimization]] that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity.<ref name="Polyanovsky2011">{{Cite journal | pmid = 22032267 | year = 2011 | last1 = Polyanovsky | first1 = V. O. | title = Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences | journal = Algorithms for Molecular Biology | volume = 6 | issue = 1 | page = 25 | last2 = Roytberg | first2 = M. A. | last3 = Tumanyan | first3 = V. G. | doi = 10.1186/1748-7188-6-25 | pmc = 3223492 | s2cid = 2658261 | doi-access = free }}</ref> A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like [[dynamic programming]]. These also include efficient, [[heuristic algorithm]]s or [[probability|probabilistic]] methods designed for large-scale database search, that do not guarantee to find best matches. ==Representations== Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the [[conservation (genetics)|conservation]] of a given amino acid substitution. For multiple sequences the last row in each column is often the [[consensus sequence]] determined by the alignment; the consensus sequence is also often represented in graphical format with a [[sequence logo]] in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.<ref name=Schneider>{{cite journal| journal=Nucleic Acids Res | volume=18 | pages=6097–6100 | year=1990 |author1=Schneider TD |author2=Stephens RM | title=Sequence logos: a new way to display consensus sequences |pmid=2172928 |pmc=332411 |url=|doi=10.1093/nar/18.20.6097| issue=20}}</ref> Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as [[FASTA format]] and [[GenBank]] format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available {{Dead link|date=August 2009}}, such as [https://web.archive.org/web/20071024223546/http://bioweb.pasteur.fr/seqanal/interfaces/readseq.html READSEQ] and [[EMBOSS]]. There are also several programming packages which provide this conversion functionality, such as [[BioPython]], [[BioRuby]] and [[BioPerl]]. The [[SAM (file format)|SAM/BAM files]] use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).<ref>{{Cite web|url=https://samtools.github.io/hts-specs/SAMv1.pdf|title=Sequence Alignment/Map Format Specification}}</ref> ===CIGAR Format=== Ref. : GTCGTAGAATA <br /> [[Read (biology)|Read]]: CACGTAG—TA <br /> CIGAR: 2S5M2D2M where: <br /> 2S = 2 soft clipping (could be mismatches, or a read longer than the matched sequence) <br /> 5M = 5 matches or mismatches <br /> 2D = 2 deletions <br /> 2M = 2 matches or mismatches The original CIGAR format from the [https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate exonerate alignment program] did not distinguish between mismatches or matches with the M character. The SAMv1 spec document defines newer CIGAR codes. In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous. {| class="wikitable" ! CIGAR Code ! BAM Integer ! Description ! Consumes query ! Consumes reference |- | M||0||alignment match (can be a sequence match or mismatch)||yes||yes |- | I||1||insertion to the reference||yes||no |- | D||2||deletion from the reference||no||yes |- | N||3||skipped region from the reference||no||yes |- | S||4||soft clipping (clipped sequences present in SEQ)||yes||no |- | H||5||hard clipping (clipped sequences NOT present in SEQ)||no||no |- | P||6||padding (silent deletion from padded reference)||no||no |- | =||7||sequence match||yes||yes |- | X||8||sequence mismatch||yes||yes |- | |} * "Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively. * H can only be present as the first and/or last operation. * S may only have H operations between them and the ends of the CIGAR string. * For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. * Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ ==Global and local alignments== Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the [[Needleman–Wunsch algorithm]], which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The [[Smith–Waterman algorithm]] is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.<ref name="Polyanovsky2011"/> Hybrid methods, known as semi-global or "glocal" (short for '''glo'''bal-lo'''cal''') methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.<ref name=brudno>{{cite journal|author1=Brudno M |author2=Malde S |author3=Poliakov A |author4=Do CB |author5=Couronne O |author6=Dubchak I |author7=Batzoglou S | year=2003 | title=Glocal alignment: finding rearrangements during alignment | journal= Bioinformatics | volume=Suppl 1| issue=90001| pages=i54–62| series=19 | pmid = 12855437| doi = 10.1093/bioinformatics/btg1005 | doi-access= }}</ref> Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence. Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time. [[Optical computing]] approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be tested [https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201900227]. ==Pairwise alignment== Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;<ref name="mount"/> however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low [[information content]] - especially where the number of repetitions differ in the two sequences to be aligned. ===Maximal unique match=== One way of quantifying the utility of a given pairwise alignment is the '[[maximal unique match]]' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness.<ref name="Alignment of whole genomes">{{cite journal |last1=Delcher |first1=A. L. |last2=Kasif |first2=S. |last3=Fleishmann |first3=R.D. |last4=Peterson |first4=J. |last5=White |first5=O. |last6=Salzberg |first6=S.L. |title=Alignment of whole genomes |journal=Nucleic Acids Research |date=1999 |volume=27 |issue=11 |pages=2369–2376 |doi=10.1093/nar/30.11.2478 |pmid=10325427|pmc=148804 |doi-access=free }}</ref> in the [[multiple sequence alignment]] of [[genomes]] in [[computational biology]]. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as [[MUMmer]]. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we can break down each word in the acronym. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment. More precisely: <blockquote>"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d= 20) such that * it is maximal, that is, it cannot be extended on either end without incurring a mismatch; and * it is unique in both sequences"<ref name="Algorithms in Bioinformatics">{{cite book |last1=Wing-Kin |first1=Sung |title=Algorithms in Bioinformatics: A Practical Introduction |date=2010 |publisher=Chapman & Hall/CRC Press |location=Boca Raton |isbn=978-1420070330 |edition=First}}</ref></blockquote> ===Dot-matrix methods=== {{Main|Dot plot (bioinformatics)}} {| style="float:right" | [[File:Mup locus showing DNA repeats.jpg|thumb|200px|Self comparison of a part of a mouse strain genome. The dot-plot shows a patchwork of lines, demonstrating duplicated segments of DNA.]] |} {| style="float:right" | [[Image:Zinc-finger-dot-plot.png|thumb|200px|A DNA [[Dot plot (bioinformatics)|dot plot]] of a [[human]] [[zinc finger]] [[transcription factor]] (GenBank ID NM_002383), showing regional [[self-similarity]]. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. This is a typical example of a [[recurrence plot]].]] |} The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or [[inverted repeat]]s—from a dot-matrix plot. To construct a [[Dot plot (bioinformatics)|dot-matrix plot]], the two sequences are written along the top row and leftmost column of a two-dimensional [[matrix (mathematics)|matrix]] and a dot is placed at any point where the characters in the appropriate columns match—this is a typical [[recurrence plot]]. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's [[main diagonal]]. Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws. Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect occurs when a protein consists of multiple similar [[structural domain]]s. ===Dynamic programming=== The technique of [[dynamic programming]] can be applied to produce global alignments via the [[Needleman-Wunsch algorithm]], and local alignments via the [[Smith-Waterman algorithm]]. In typical usage, protein alignments use a [[substitution matrix]] to assign scores to amino-acid matches or mismatches, and a [[gap penalty]] for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore [[base stacking]] effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.){{citation needed|date=April 2024}} A common extension to standard linear gap costs are affine gap costs. Here two different gap penalties are applied for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. This results in fewer gaps in an alignment and residues and gaps are kept together, traits more representative of biological sequences. The Gotoh algorithm implements affine gap costs by using three matrices.<ref>{{Cite journal |last=Gotoh |first=Osamu |date=1982-12-15 |title=An improved algorithm for matching biological sequences |url=https://linkinghub.elsevier.com/retrieve/pii/0022283682903989 |journal=Journal of Molecular Biology |volume=162 |issue=3 |pages=705–708 |doi=10.1016/0022-2836(82)90398-9 |pmid=7166760 |issn=0022-2836|url-access=subscription }}</ref><ref>{{Cite journal |last=Gotoh |first=Osamu |date=1999-01-01 |title=Multiple sequence alignment: Algorithms and applications |url=https://linkinghub.elsevier.com/retrieve/pii/S0065227X99800070 |journal=Advances in Biophysics |volume=36 |pages=159–206 |doi=10.1016/S0065-227X(99)80007-0 |pmid=10463075 |issn=0065-227X|url-access=subscription }}</ref> Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account [[frameshift]] mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The [[BLAST (biotechnology)|BLAST]] and [[EMBOSS]] suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from [[open-source software]] such as [http://www.ebi.ac.uk/Tools/psa/genewise/ GeneWise].{{citation needed|date=April 2024}} The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.{{citation needed|date=April 2024}} ===Word methods=== Word methods, also known as ''k''-tuple methods, are [[heuristic]] methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools [[FASTA]] and the [[BLAST (biotechnology)|BLAST]] family.<ref name=mount/> Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated. In the FASTA method, the user defines a value ''k'' to use as the word length with which to search the database. The method is slower but more sensitive at lower values of ''k'', which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length ''k'', but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as [http://www.ebi.ac.uk/fasta33/ EMBL FASTA] and [https://www.ncbi.nlm.nih.gov/BLAST/ NCBI BLAST]. ==Multiple sequence alignment== {{Main|Multiple sequence alignment}} [[Image:Hemagglutinin-alignments.png|right|thumb|300px|Alignment of 27 [[avian influenza]] [[hemagglutinin]] protein sequences colored by residue conservation (top) and residue properties (bottom)]] [[Multiple sequence alignment]] is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying [[conservation (genetics)|conserved]] sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and [[reaction mechanism|mechanistic]] information to locate the catalytic [[active site]]s of [[enzyme]]s. Alignments are also used to aid in establishing evolutionary relationships by constructing [[phylogenetic tree]]s. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to [[NP-complete]] combinatorial optimization problems.<ref name=wang>{{cite journal | journal=J Comput Biol | volume=1 | pages=337–48 | year=1994 |author1=Wang L |author2=Jiang T. | title=On the complexity of multiple sequence alignment | pmid=8790475 | doi = 10.1089/cmb.1994.1.337| issue=4 | citeseerx=10.1.1.408.894 }}</ref><ref name=elias>{{cite journal | journal=J Comput Biol | volume=13 | pages=1323–1339 | year=2006 | author=Elias, Isaac | title=Settling the intractability of multiple alignment | pmid=17037961 | doi =10.1089/cmb.2006.13.1323 | issue=7 | citeseerx=10.1.1.6.256 }}</ref> Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences. ===Dynamic programming=== The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and [[computer memory|memory]], it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" [[objective function]], has been implemented in the [https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA] software package.<ref name=lipman>{{cite journal | journal=Proc Natl Acad Sci USA | volume=86 | pages=4412–5 | year=1989 |author1=Lipman DJ |author2=Altschul SF |author3=Kececioglu JD | title=A tool for multiple sequence alignment | pmid=2734293 | doi=10.1073/pnas.86.12.4412 | issue=12 | pmc=287279 | bibcode=1989PNAS...86.4412L | doi-access=free }}</ref> ===Progressive methods=== Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to [[FASTA]]. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy. Many variations of the [[Clustal]] progressive implementation<ref name=higgins>{{cite journal | journal=Gene | volume=73 | issue=1 | pages=237–44 | year=1988 | author=[[Desmond G. Higgins|Higgins DG]], Sharp PM | title=CLUSTAL: a package for performing multiple sequence alignment on a microcomputer | pmid=3243435 | doi = 10.1016/0378-1119(88)90330-7 }}</ref><ref name=thompson>{{cite journal | journal=Nucleic Acids Res | volume=22 | pages=4673–80 | year=1994 | author1=Thompson JD| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Gibson TJ. | title=CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice | pmid=7984417 |pmc=308517 |url=|doi=10.1093/nar/22.22.4673 | issue=22 }}</ref><ref name=chenna>{{cite journal | journal=Nucleic Acids Res | volume=31 | pages=3497–500 | year=2003 |author1=Chenna R |author2=Sugawara H |author3=Koike T |author4=Lopez R |author5=Gibson TJ |author6=Higgins DG |author7=Thompson JD. | title=Multiple sequence alignment with the Clustal series of programs | url= | pmid=12824352 | doi = 10.1093/nar/gkg500 | issue=13 | pmc=168907 }}</ref> are used for multiple sequence alignment, phylogenetic tree construction, and as input for [[protein structure prediction]]. A slower but more accurate variant of the progressive method is known as [[T-Coffee]].<ref name=notredame>{{cite journal | journal=J Mol Biol | volume=302 | issue=1 | pages=205–17 | year=2000 | author1=Notredame C| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Heringa J. | title=T-Coffee: A novel method for fast and accurate multiple sequence alignment | pmid=10964570 | doi = 10.1006/jmbi.2000.4042 | s2cid=10189971 }}</ref> ===Iterative methods=== Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an [[objective function]] based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.<ref name=hirosawa>{{cite journal | journal=Comput Appl Biosci | volume=11 | pages=13–8 | year=1995 |author1=Hirosawa M |author2=Totoki Y |author3=Hoshida M |author4=Ishikawa M. | title=Comprehensive study on iterative algorithms of multiple sequence alignment | pmid=7796270 | doi = 10.1093/bioinformatics/11.1.13 | issue=1 }}</ref> ===Motif finding=== Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved [[sequence motif]]s among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly [[conservation (genetics)|conserved]] regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original [[data set]] contained a small number of sequences, or only highly related sequences, [[pseudocount]]s are added to normalize the character distributions represented in the motif. ===Techniques inspired by computer science=== [[File:A profile HMM modelling a multiple sequence alignment.png|thumb|A profile HMM modelling a multiple sequence alignment]] A variety of general [[Optimization (mathematics)|optimization]] algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. [[Hidden Markov model]]s have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.<ref name=karplus>{{cite journal | journal=Bioinformatics | volume=14 | issue=10 | pages= 846–856| year=1998 |author1=Karplus K |author2=Barrett C |author3=Hughey R. | title=Hidden Markov models for detecting remote protein homologies | pmid=9927713 | doi = 10.1093/bioinformatics/14.10.846 | doi-access=free | citeseerx=10.1.1.57.2762 }}</ref> [[Genetic algorithm]]s and [[simulated annealing]] have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article [[multiple sequence alignment]]. The [[Burrows–Wheeler transform]] has been successfully applied to fast short read alignment in popular tools such as [[Bowtie (sequence analysis)|Bowtie]] and BWA. See [[FM-index]]. ==Structural alignment== {{Main|Structural alignment}} Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the [[secondary structure|secondary]] and [[tertiary structure]] of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through [[X-ray crystallography]] or [[NMR spectroscopy]]). Because both protein and RNA structure is more evolutionarily conserved than sequence,<ref name=chothia>{{cite journal | journal=EMBO J | volume=5 | issue=4 | pages=823–6 |date=April 1986 |author1=Chothia C |author2=Lesk AM. | title=The relation between the divergence of sequence and structure in proteins | pmid=3709526 |pmc=1166865 | doi=10.1002/j.1460-2075.1986.tb04288.x }}</ref> structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity. Structural alignments are used as the "gold standard" in evaluating alignments for homology-based [[protein structure prediction]]<ref name=skolnick>{{cite journal | journal=Proc Natl Acad Sci USA | volume=102 | pages=1029–34 | year=2005 |author1=Zhang Y |author2=Skolnick J. | title=The protein structure prediction problem could be solved using the current PDB library | pmid=15653774 | doi = 10.1073/pnas.0407152101 | issue=4 | pmc=545829 | bibcode=2005PNAS..102.1029Z | doi-access=free }}</ref> because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.<ref name=skolnick/> ===DALI=== The DALI method, or [[distance matrix]] alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.<ref name=holm>{{cite journal | journal=Science | volume=273 | pages=595–603 | year=1996 |author1=Holm L |author2=Sander C | title=Mapping the protein universe | pmid=8662544 | doi = 10.1126/science.273.5275.595 | issue=5275 | bibcode=1996Sci...273..595H | s2cid=7509134 }}</ref> It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the [[Protein Data Bank]] (PDB). It has been used to construct the [[Families of structurally similar proteins|FSSP]] structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at [https://web.archive.org/web/20090301064750/http://ekhidna.biocenter.helsinki.fi/dali_server/start DALI] and the FSSP is located at [https://web.archive.org/web/20051125045348/http://ekhidna.biocenter.helsinki.fi/dali/start The Dali Database]. ===SSAP=== SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments,<ref name=taylor>{{cite journal|journal=Protein Sci |volume=3 |pages=1858–70 |year=1994 |author1=Taylor WR |author2=Flores TP |author3=Orengo CA. |title=Multiple protein structure alignment |pmid=7849601 |doi=10.1002/pro.5560031025 |issue=10 |pmc=2142613 }}</ref> and has been used in the construction of the [[CATH]] (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds.<ref name=orengo>{{cite journal | journal=Structure | volume=5 | pages=1093–108 | year=1997 |author1=Orengo CA |author2=Michie AD |author3=Jones S |author4=Jones DT |author5=Swindells MB |author6=Thornton JM | title=CATH--a hierarchic classification of protein domain structures | pmid=9309224 | doi=10.1016/S0969-2126(97)00260-8 | issue=8 | doi-access=free }}</ref> The CATH database can be accessed at [http://www.cathdb.info/ CATH Protein Structure Classification]. ===Combinatorial extension=== The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.<ref name=shindyalov>{{cite journal | journal=Protein Eng | volume=11 | pages=739–47 | year=1998 |author1=Shindyalov IN |author2=Bourne PE. | title=Protein structure alignment by incremental combinatorial extension (CE) of the optimal path | pmid=9796821 | doi = 10.1093/protein/11.9.739 | issue=9 | doi-access=free }}</ref> Based on measures such as rigid-body [[Root mean square deviation (bioinformatics)|root mean square distance]], residue distances, local secondary structure, and surrounding environmental features such as residue neighbor [[hydrophobic]]ity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the [https://web.archive.org/web/19981203071023/http://cl.sdsc.edu/ Combinatorial Extension] website. ==Phylogenetic analysis== {{Main|Computational phylogenetics}} Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.<ref name=ortet>{{cite journal|author1=Ortet P |author2=Bastien O | year=2010 | title=Where Does the Alignment Score Distribution Shape Come from? | journal= Evolutionary Bioinformatics | volume=6| pages=159–187| pmid = 21258650| doi = 10.4137/EBO.S5875 | url=http://www.la-press.com/where-does-the-alignment-score-distribution-shape-come-from-article-a2393| pmc=3023300}}</ref> The field of [[phylogenetics]] makes extensive use of sequence alignments in the construction and interpretation of [[phylogenetic tree]]s, which are used to classify the evolutionary relationships between homologous [[gene]]s represented in the [[genome]]s of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young [[most recent common ancestor]], while low identity suggests that the divergence is more ancient. This approximation, which reflects the "[[molecular clock]]" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the [[coalescence (genetics)|coalescence]] time), assumes that the effects of mutation and [[natural selection|selection]] are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of [[DNA repair]] or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between [[silent mutation]]s that do not alter the meaning of a given [[codon]] and other mutations that result in a different [[amino acid]] being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes. Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly [[heuristic]] because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is [[NP-hard]].<ref name=felsenstein>{{cite book| author=Felsenstein J. | year=2004| title=Inferring Phylogenies | publisher= Sinauer Associates: Sunderland, MA | isbn=978-0-87893-177-4}}</ref> ===Assessment of significance=== Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that [[convergent evolution]] can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures. In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts. Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite book|author1=Altschul SF |author2=Gish W |chapter=Local alignment statistics |title=Computer Methods for Macromolecular Sequence Analysis | year=1996 | volume=266 | pages = 460–480|doi=10.1016/S0076-6879(96)66029-7|pmid=8743700 |series=Methods in Enzymology|isbn=9780121821678}}</ref><ref name=hartmann>{{cite journal| author=Hartmann AK| year=2002| title=Sampling rare events: statistics of local sequence alignments| journal= Phys. Rev. E| volume=65| page=056102|doi=10.1103/PhysRevE.65.056102| pmid=12059642| issue=5|arxiv=cond-mat/0108201|bibcode=2002PhRvE..65e6102H| s2cid=193085}}</ref><ref name=newberg>{{cite journal| author=Newberg LA | year=2008 | title=Significance of gapped sequence alignments | journal= J Comput Biol| volume=15| pages=1187–1194 | pmid = 18973434 | doi=10.1089/cmb.2008.0125| issue=9| pmc=2737730}}</ref><ref name=eddy>{{cite journal| author=Eddy SR| year=2008 | title=A probabilistic model of local sequence alignment that simplifies statistical significance estimation | journal= PLOS Comput Biol | volume=4| editor1-first=Burkhard| pages=e1000069 | pmid = 18516236| editor1-last=Rost | doi=10.1371/journal.pcbi.1000069| issue=5| pmc=2396288| last2=Rost| first2=Burkhard| bibcode=2008PLSCB...4E0069E| s2cid=15640896 | doi-access=free }}</ref><ref name=bastien>{{cite journal|author1=Bastien O |author2=Aude JC |author3=Roy S |author4=Marechal E | year=2004 | title=Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics | journal= Bioinformatics | volume=20| issue=4| pages=534–537| pmid = 14990449| doi = 10.1093/bioinformatics/btg440 | doi-access=free | citeseerx=10.1.1.602.6979 }}</ref><ref name=agrawal11>{{cite journal|author1=Agrawal A |author2=Huang X | year=2011| title=Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices|journal= IEEE/ACM Transactions on Computational Biology and Bioinformatics| volume=8| pages=194–205|doi=10.1109/TCBB.2009.69|pmid=21071807 | issue=1|s2cid=6559731 }}</ref><ref name=agrawal08>{{cite journal| author1=Agrawal A| author2=Brendel VP| author3=Huang X| year=2008| title=Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment| journal=International Journal of Computational Biology and Drug Design| volume=1| pages=347–367| doi=10.1504/IJCBDD.2008.022207| pmid=20063463| url=http://inderscience.metapress.com/content/1558538106522500/| issue=4| url-status=dead| archive-url=https://archive.today/20130128163812/http://inderscience.metapress.com/content/1558538106522500/| archive-date=28 January 2013| df=dmy-all| url-access=subscription}}</ref> ===Assessment of credibility=== Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.<ref name=NewbergLawrence2009>{{cite journal|author1=Newberg LA |author2=Lawrence CE | year=2009 | title=Exact Calculation of Distributions on Integers, with Application to Sequence Alignment | journal= J Comput Biol| volume=16| pages=1–18 | pmid = 19119992 | doi=10.1089/cmb.2008.0137| issue=1| pmc=2858568}}</ref> ===Scoring functions=== The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using [[substitution matrix|substitution matrices]] that reflect the probabilities of given character-to-character substitutions. A series of matrices called [[Point accepted mutation|PAM matrices]] (Point Accepted Mutation matrices, originally defined by [[Margaret Dayhoff]] and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as [[BLOSUM]] (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. [[Gap penalty|Gap penalties]] account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function. It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters. ==Other biological uses== Sequenced RNA, such as [[expressed sequence tags]] and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about [[alternative splicing]]<ref>{{cite book |author1=Kim N |author2=Lee C |title=Bioinformatics |chapter=Bioinformatics Detection of Alternative Splicing |volume=452 |pages=179–97 |year=2008 |pmid=18566765 |doi=10.1007/978-1-60327-159-2_9 |series=Methods in Molecular Biology |isbn=978-1-58829-707-5}}</ref> and [[RNA editing]].<ref>{{cite journal |vauthors=Li JB, Levanon EY, Yoon JK, etal |title=Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing |journal=Science |volume=324 |issue=5931 |pages=1210–3 |date=May 2009 |pmid=19478186 |doi=10.1126/science.1170995|bibcode=2009Sci...324.1210L |s2cid=31148824 }}</ref> Sequence alignment is also a part of [[genome assembly]], where sequences are aligned to find overlap so that ''[[contig]]s'' (long stretches of sequence) can be formed.<ref>{{cite journal |vauthors=Blazewicz J, Bryja M, Figlerowicz M, etal |title=Whole genome assembly from 454 sequencing output via modified DNA graph concept |journal=Comput Biol Chem |volume=33 |issue=3 |pages=224–30 |date=June 2009 |pmid=19477687 |doi=10.1016/j.compbiolchem.2009.04.005}}</ref> Another use is [[single nucleotide polymorphism|SNP]] analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.<ref>{{cite journal |author1=Duran C |author2=Appleby N |author3=Vardy M |author4=Imelfort M |author5=Edwards D |author6=Batley J |title=Single nucleotide polymorphism discovery in barley using autoSNPdb |journal=Plant Biotechnol. J. |volume=7 |issue=4 |pages=326–33 |date=May 2009 |pmid=19386041 |doi=10.1111/j.1467-7652.2009.00407.x |doi-access=free |bibcode=2009PBioJ...7..326D }}</ref> ==Non-biological uses== The methods used for biological sequence alignment have also found applications in other fields, most notably in [[natural language processing]] and in [[Sequence analysis in social sciences|social sciences]], where the [[Needleman-Wunsch algorithm]] is usually referred to as [[Optimal matching]].<ref>{{cite journal|author1=Abbott A. |author2=Tsay A. | year=2000 | title=Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect | journal=Sociological Methods and Research | volume=29|issue=1 | pages=3–33 | doi=10.1177/0049124100029001001|s2cid=121097811 }}</ref> Techniques that generate the set of elements from which words will be selected in [[natural language generation|natural-language generation]] algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of [[automated theorem proving|computer-generated mathematical proofs]].<ref name=Barzilay>{{cite book|author1=Barzilay R |author2=Lee L. |title=Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP '02 |chapter=Bootstrapping lexical choice via multiple-sequence alignment |year=2002 | pages=164–171 | chapter-url=http://www.cs.cornell.edu/home/llee/papers/gen-msa.pdf| volume=10| doi=10.3115/1118693.1118715|arxiv=cs/0205065|bibcode=2002cs........5065B |s2cid=7521453 }}</ref> In the field of historical and comparative [[linguistics]], sequence alignment has been used to partially automate the [[comparative method (linguistics)|comparative method]] by which linguists traditionally reconstruct languages.<ref>{{cite thesis |author=Kondrak, Grzegorz |title=Algorithms for Language Reconstruction |publisher=University of Toronto |year=2002 |url=http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |access-date=2007-01-21 |archive-url=https://web.archive.org/web/20081217043010/http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |archive-date=17 December 2008 |url-status=dead }}</ref> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<ref name=prinzie>{{cite journal|author1=Prinzie A. |author2=D. Van den Poel |year=2006 | url=http://econpapers.repec.org/paper/rugrugwps/05_2F292.htm | title=Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM | journal=Decision Support Systems | volume=42 | issue=2| pages= 508–526 | doi=10.1016/j.dss.2005.02.004| url-access=subscription }} See also Prinzie and Van den Poel's paper {{cite journal | url=http://econpapers.repec.org/paper/rugrugwps/07_2F442.htm | title=Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models | year=2007 | journal=Decision Support Systems | volume=44 | issue=1 | pages= 28–45 | doi=10.1016/j.dss.2007.02.008 | author=Prinzie, A | last2=Vandenpoel | first2=D| url-access=subscription }}</ref> ==Software== {{Main|Sequence alignment software}} A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI|last=EMBL-EBI|website=www.EBI.ac.uk|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web|url=http://blast.ncbi.nlm.nih.gov/Blast.cgi|title=BLAST: Basic Local Alignment Search Tool|website=blast.ncbi.nlm.NIH.gov|access-date=12 June 2017}}</ref> and FASTA3x<ref>{{cite web|url=http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml|title=UVA FASTA Server|website=fasta.bioch.Virginia.edu|access-date=12 June 2017}}</ref> for database searching. Commercial tools such as [[DNASTAR|DNASTAR Lasergene]], [[Geneious]], and [[PatternHunter]] are also available. Tools annotated as performing [http://edamontology.org/operation_0292 sequence alignment] are listed in the [https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score bio.tools] registry. Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal | journal=Bioinformatics | volume=15 | pages=87–8 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O | title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs | pmid=10068696 | doi = 10.1093/bioinformatics/15.1.87 | issue=1 | doi-access=free }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal | journal=Nucleic Acids Res | volume=27 | pages=2682–90 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O. | title=A comprehensive comparison of multiple sequence alignment programs | url= | pmid=10373585 | doi = 10.1093/nar/27.13.2682 | issue=13 | pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web|url=http://3d-alignment.eu/|title=Multiple sequence alignment: Strap|website=3d-alignment.eu|access-date=12 June 2017}}</ref> ==See also== * [[Sequence homology]] * [[Sequence mining]] * [[BLAST (biotechnology)|BLAST]] * [[String searching algorithm]] * [[Alignment-free sequence analysis]] * [[UGENE]] * [[Needleman–Wunsch algorithm]] * [[Smith–Waterman algorithm|Smith-Waterman algorithm]] * [[Sequence analysis in social sciences]] ==References== {{Reflist|30em}} ==External links== {{Wikiversity|Dot-matrix methods}} {{Spoken Wikipedia|En-Sequence_alignment.ogg|date=2012-06-05}} * {{Commons category-inline}} {{Bioinformatics}} {{Strings}} {{Authority control}} [[Category:Bioinformatics algorithms]] [[Category:Computational phylogenetics]] [[Category:Sequence alignment algorithms| ]] [[Category:Evolutionary developmental biology]] [[Category:Algorithms on strings]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Authority control
(
edit
)
Template:Bioinformatics
(
edit
)
Template:Citation needed
(
edit
)
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite thesis
(
edit
)
Template:Cite web
(
edit
)
Template:Commons category-inline
(
edit
)
Template:Dead link
(
edit
)
Template:Main
(
edit
)
Template:More citations needed
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)
Template:Spoken Wikipedia
(
edit
)
Template:Strings
(
edit
)
Template:Use dmy dates
(
edit
)
Template:Wikiversity
(
edit
)