Editing Sequence alignment (section)

==Pairwise alignment==
Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;<ref name="mount"/> however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low [[information content]] - especially where the number of repetitions differ in the two sequences to be aligned.

===Maximal unique match===
One way of quantifying the utility of a given pairwise alignment is the '[[maximal unique match]]' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness.<ref name="Alignment of whole genomes">{{cite journal |last1=Delcher |first1=A. L. |last2=Kasif |first2=S. |last3=Fleishmann |first3=R.D. |last4=Peterson |first4=J. |last5=White |first5=O. |last6=Salzberg |first6=S.L. |title=Alignment of whole genomes |journal=Nucleic Acids Research |date=1999 |volume=27 |issue=11 |pages=2369–2376 |doi=10.1093/nar/30.11.2478 |pmid=10325427|pmc=148804 |doi-access=free }}</ref> in the [[multiple sequence alignment]] of [[genomes]] in [[computational biology]].  Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as [[MUMmer]]. Anchors are the areas between two genomes where they are highly similar.  To understand what a MUM is we can break down each word in the acronym.  Match implies that the substring occurs in both sequences to be aligned.  Unique means that the substring occurs only once in each sequence.  Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements.  The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.

More precisely:

<blockquote>"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d= 20) such that
* it is maximal, that is, it cannot be extended on either end without incurring a mismatch; and
* it is unique in both sequences"<ref name="Algorithms in Bioinformatics">{{cite book |last1=Wing-Kin |first1=Sung |title=Algorithms in Bioinformatics: A Practical Introduction |date=2010 |publisher=Chapman & Hall/CRC Press |location=Boca Raton |isbn=978-1420070330 |edition=First}}</ref></blockquote>

===Dot-matrix methods===
{{Main|Dot plot (bioinformatics)}}
{| style="float:right"
| [[File:Mup locus showing DNA repeats.jpg|thumb|200px|Self comparison of a part of a mouse strain genome. The dot-plot shows a patchwork of lines, demonstrating duplicated segments of DNA.]]
|}
{| style="float:right"
| [[Image:Zinc-finger-dot-plot.png|thumb|200px|A DNA [[Dot plot (bioinformatics)|dot plot]] of a [[human]] [[zinc finger]] [[transcription factor]] (GenBank ID NM_002383), showing regional [[self-similarity]]. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. This is a typical example of a [[recurrence plot]].]]
|}

The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or [[inverted repeat]]s—from a dot-matrix plot. To construct a [[Dot plot (bioinformatics)|dot-matrix plot]], the two sequences are written along the top row and leftmost column of a two-dimensional [[matrix (mathematics)|matrix]] and a dot is placed at any point where the characters in the appropriate columns match—this is a typical [[recurrence plot]]. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's [[main diagonal]].

Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.

Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect occurs when a protein consists of multiple similar [[structural domain]]s.

===Dynamic programming===
The technique of [[dynamic programming]] can be applied to produce global alignments via the [[Needleman-Wunsch algorithm]], and local alignments via the [[Smith-Waterman algorithm]]. In typical usage, protein alignments use a [[substitution matrix]] to assign scores to amino-acid matches or mismatches, and a [[gap penalty]] for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore [[base stacking]] effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.){{citation needed|date=April 2024}}
A common extension to standard linear gap costs are affine gap costs. Here two different gap penalties are applied for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. This results in fewer gaps in an alignment and residues and gaps are kept together, traits more representative of biological sequences. The Gotoh algorithm implements affine gap costs by using three matrices.<ref>{{Cite journal |last=Gotoh |first=Osamu |date=1982-12-15 |title=An improved algorithm for matching biological sequences |url=https://linkinghub.elsevier.com/retrieve/pii/0022283682903989 |journal=Journal of Molecular Biology |volume=162 |issue=3 |pages=705–708 |doi=10.1016/0022-2836(82)90398-9 |pmid=7166760 |issn=0022-2836|url-access=subscription }}</ref><ref>{{Cite journal |last=Gotoh |first=Osamu |date=1999-01-01 |title=Multiple sequence alignment: Algorithms and applications |url=https://linkinghub.elsevier.com/retrieve/pii/S0065227X99800070 |journal=Advances in Biophysics |volume=36 |pages=159–206 |doi=10.1016/S0065-227X(99)80007-0 |pmid=10463075 |issn=0065-227X|url-access=subscription }}</ref>

Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account [[frameshift]] mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The [[BLAST (biotechnology)|BLAST]] and [[EMBOSS]] suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from [[open-source software]] such as [http://www.ebi.ac.uk/Tools/psa/genewise/ GeneWise].{{citation needed|date=April 2024}}

The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.{{citation needed|date=April 2024}}

===Word methods===
Word methods, also known as ''k''-tuple methods, are [[heuristic]] methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools [[FASTA]] and the [[BLAST (biotechnology)|BLAST]] family.<ref name=mount/> Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.

In the FASTA method, the user defines a value ''k'' to use as the word length with which to search the database. The method is slower but more sensitive at lower values of ''k'', which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length ''k'', but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as [http://www.ebi.ac.uk/fasta33/ EMBL FASTA] and [https://www.ncbi.nlm.nih.gov/BLAST/ NCBI BLAST].