Editing Distance matrix (section)

== Bioinformatics ==
{{missing information|section|alignment-free distance measures (Mash, K(r), FastANI, Skmer etc.); need less weight on how to do alignment (especially with "dumb" DP) and more weight on how to get distance from alignment|date=December 2023}}
The distance matrix is widely used in the bioinformatics field,  and it is present in several methods, algorithms and programs. Distance matrices are used to represent [[protein]] structures in a coordinate-independent manner, as well as the pairwise distances between two sequences in [[sequence space]]. They are used in [[structural alignment|structural]] and [[sequence alignment|sequential]] alignment, and for the determination of protein structures from [[Nuclear magnetic resonance|NMR]] or [[X-ray crystallography]].

Sometimes it is more convenient to express data as a [[similarity matrix]].

It is also used to define the [[distance correlation]].

=== [[Sequence alignment]] ===
An alignment of two sequences is formed by inserting spaces in arbitrary locations along the sequences so that they end up with the same length and there are no two spaces at the same position of the two augmented sequences.<ref name=":0">{{Cite book |last=Sung |first=Wing-Kin |title=Algorithms in bioinformatics: A practical introduction |publisher=Chapman & Hall |year=2010 |isbn=978-1-4200-7033-0 |pages=29}}</ref> One of the primary methods for sequence alignment is [[dynamic programming]]. The method is used to fill the distance matrix and then obtain the alignment. In typical usage, for sequence alignment a matrix is used to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino-acid in one sequence with a gap in the other.

==== Global alignment ====
The [[Needleman–Wunsch algorithm]] used to calculate global alignment uses dynamic programming to obtain the distance matrix.

==== Local alignment ====
The [[Smith–Waterman algorithm]] is also dynamic programming based which consists also in obtaining the distance matrix and then obtain the local alignment.

==== Multiple sequence alignment ====
[[Multiple sequence alignment]] is an extension of pairwise alignment to align several sequences at a time. Different MSA methods are based on the same idea of the distance matrix as global and local alignments.

* Center star method. This method defines a center sequence {{Math|''S''<sub>c</sub>}} which minimizes the distance between the sequence {{Math|''S''<sub>c</sub>}} and any other sequence {{Math|''S''<sub>i</sub>}}. Then it generates a multiple alignment {{Math|M}} for the set of sequences {{Math|''S''}} so that for every {{Math|''S''<sub>i</sub>}} the alignment distance {{Math|''d''<sub>''M''</sub>(''S''<sub>c</sub>,''S''<sub>i</sub>)}} is the optimal pairwise alignment. This method has the characteristic that the computed alignment for {{Math|''S''}} whose sum-of-pair distance is at most twice the optimal multiple alignment.
* Progressive alignment method. This heuristic method to create MSA first aligns the two most related sequences, and then it progressively aligns the next two most related sequences until all sequences are aligned.

There are other methods that have their own program due to their popularity:

* [[Clustal|ClustalW]]
* [[MUSCLE (alignment software)|MUSCLE]]
* [[MAFFT]]
* MANGO
* And many more

===== MAFFT =====
Multiple alignment using fast Fourier transform (MAFFT) is a program with an algorithm based on progressive alignment, and it offers various multiple alignment strategies. First, MAFFT constructs a distance matrix based on the number of shared 6-tuples. Second, it builds the guide tree based on the previous matrix. Third, it clusters the sequences with the help of the [[fast Fourier transform]] and starts the alignment. Based on the new alignment, it reconstructs the guide tree and align again.

=== Phylogenetic analysis ===
{{cleanup section|reason=Should be trimmed down and mostly packed up to the main article. That's how [[WP:SPINOFF]] works, right?|date=December 2023}}
{{Main|Distance matrices in phylogeny}}
To perform [[phylogenetic]] analysis, the first step is to reconstruct the phylogenetic tree: given a collection of species, the problem is to reconstruct or infer the ancestral relationships among the species, i.e., the phylogenetic tree among the species. Distance matrix methods perform this activity.

==== Distance matrix methods ====
Distance matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore require multiple sequences as an input. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same [[interior node]] and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them.<ref name=":1">{{Cite book |last=Felsenstein |first=Joseph |title=Inferring phylogenies |publisher=Sinauer Associates |year=2003 |isbn=9780878931774}}</ref> Given  {{Math|''n''}} species, the input is an {{Math|''n'' × ''n''}} distance matrix {{Math|M}} where {{Math|''M''<sub>ij</sub>}} is the mutation distance between species {{Math|''i''}} and {{Math|''j''}} . The aim is to output a tree of degree {{Math|3}} which is consistent with the distance matrix.

They are frequently used as the basis for progressive and iterative types of [[multiple sequence alignment]]. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.<ref name=":1" /> Despite potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as [[DNA–DNA hybridization]] assays.

The following are distance based methods for phylogeny reconstruction:

* Additive tree reconstruction
* [[UPGMA]]
* [[Neighbor joining]]
* [[Distance matrices in phylogeny|Fitch–Margoliash]]

===== Additive tree reconstruction =====
Additive tree reconstruction is based on additive and ultrametric distance matrices. These matrices have a special characteristic:

Consider an additive matrix {{Math|M}}. For any three species {{Math|i, j, k,}} the corresponding tree is unique.<ref name=":0" /> Every ultrametric distance matrix is an additive matrix. We can observe this property for the tree below, which consists on the species {{Math|i, j, k}}.
[[File:Unique_tree_additive_matrix.png|center|frameless|Phylogenetic tree from 3 species]]
The additive tree reconstruction technique starts with this tree. And then adds one more species each time, based on the distance matrix combined with the property mentioned above. For example, consider an additive matrix {{Math|M}} and 5 species {{Math|''a'', ''b'', ''c'', ''d''}} and {{Math|''e''}}. First we form an additive tree for two species {{Math|''a''}} and {{Math|''b''}}. Then we chose a third one, let's say {{Math|''c''}} and attach it to a point {{Math|''x''}} on the edge between {{Math|''a''}} and {{Math|''b''}}. The edge weights are computed with the property above. Next we add the fourth species {{Math|''d''}} to any of the edges. If we apply the property then we identify that {{Math|''d''}} should be attached to only one specific edge. Finally, we add {{Math|''e''}} following the same procedure as before.

===== UPGMA =====
The basic principle of UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is that similar species should be closer in the phylogenetic tree. Hence, it builds the tree by clustering similar sequences iteratively. The method works by building the phylogenetic tree bottom up from its leaves. Initially, we have {{Math|''n''}} leaves (or {{Math|''n''}} singleton trees), each representing a species in {{Math|''S''}}. Those {{Math|''n''}} leaves are referred as {{Math|''n''}} clusters. Then, we perform {{Math|''n''-1}} iterations. In each iteration, we identify two clusters {{Math|''C''<sub>1</sub>}} and {{Math|''C''<sub>2</sub>}} with the smallest average distance and merge them to form a bigger cluster {{Math|''C''}}. If we suppose {{Math|M}} is ultrametric, for any cluster {{Math|''C''}} created by the UPGMA algorithm, {{Math|''C''}} is a valid ultrametric tree.

===== Neighbor joining =====
Neighbor is a bottom-up clustering method. It takes a distance matrix specifying the distance between each pair of sequences. The algorithm starts with a completely unresolved tree, whose topology corresponds to that of a [[star network]], and iterates over the following steps until the tree is completely resolved and all branch lengths are known:

# Based on the current distance matrix calculate the matrix (defined below).
# Find the pair of distinct taxa i and j (i.e. with) for which has its lowest value. These taxa are joined to a newly created node, which is connected to the central node. 
# Calculate the distance from each of the [[Taxon|taxa]] in the pair to this new node.
# Calculate the distance from each of the taxa outside of this pair to the new node.
# Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.<ref>{{Cite journal |last=Saitou |first=Naruya |date=1987 |title=The neighbor-joining method: A new method for reconstructing phylogenetic trees |url=https://academic.oup.com/mbe/article/4/4/406/1029664?login=false |journal=Molecular Biology and Evolution |volume=4|issue=4 |pages=406–425 |doi=10.1093/oxfordjournals.molbev.a040454 |pmid=3447015 |doi-access=free }}</ref>

===== Fitch–Margoliash =====
The Fitch–Margoliash method uses a weighted [[least squares]] method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost.<ref>{{Cite journal |last=Fitch |first=Walter M. |date=1967 |title=Construction of Phylogenetic Trees: A method based on mutation distances as estimated from cytochrome c sequences is of general applicability. |url=https://www.science.org/doi/10.1126/science.155.3760.279 |journal=Science |volume=155|issue=3760 |pages=279–284 |doi=10.1126/science.155.3760.279 |pmid=5334057 }}</ref>