Editing Distance matrix (section)

==== Distance matrix methods ====
Distance matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore require multiple sequences as an input. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same [[interior node]] and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them.<ref name=":1">{{Cite book |last=Felsenstein |first=Joseph |title=Inferring phylogenies |publisher=Sinauer Associates |year=2003 |isbn=9780878931774}}</ref> Given  {{Math|''n''}} species, the input is an {{Math|''n'' × ''n''}} distance matrix {{Math|M}} where {{Math|''M''<sub>ij</sub>}} is the mutation distance between species {{Math|''i''}} and {{Math|''j''}} . The aim is to output a tree of degree {{Math|3}} which is consistent with the distance matrix.

They are frequently used as the basis for progressive and iterative types of [[multiple sequence alignment]]. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.<ref name=":1" /> Despite potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as [[DNA–DNA hybridization]] assays.

The following are distance based methods for phylogeny reconstruction:

* Additive tree reconstruction
* [[UPGMA]]
* [[Neighbor joining]]
* [[Distance matrices in phylogeny|Fitch–Margoliash]]

===== Additive tree reconstruction =====
Additive tree reconstruction is based on additive and ultrametric distance matrices. These matrices have a special characteristic:

Consider an additive matrix {{Math|M}}. For any three species {{Math|i, j, k,}} the corresponding tree is unique.<ref name=":0" /> Every ultrametric distance matrix is an additive matrix. We can observe this property for the tree below, which consists on the species {{Math|i, j, k}}.
[[File:Unique_tree_additive_matrix.png|center|frameless|Phylogenetic tree from 3 species]]
The additive tree reconstruction technique starts with this tree. And then adds one more species each time, based on the distance matrix combined with the property mentioned above. For example, consider an additive matrix {{Math|M}} and 5 species {{Math|''a'', ''b'', ''c'', ''d''}} and {{Math|''e''}}. First we form an additive tree for two species {{Math|''a''}} and {{Math|''b''}}. Then we chose a third one, let's say {{Math|''c''}} and attach it to a point {{Math|''x''}} on the edge between {{Math|''a''}} and {{Math|''b''}}. The edge weights are computed with the property above. Next we add the fourth species {{Math|''d''}} to any of the edges. If we apply the property then we identify that {{Math|''d''}} should be attached to only one specific edge. Finally, we add {{Math|''e''}} following the same procedure as before.

===== UPGMA =====
The basic principle of UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is that similar species should be closer in the phylogenetic tree. Hence, it builds the tree by clustering similar sequences iteratively. The method works by building the phylogenetic tree bottom up from its leaves. Initially, we have {{Math|''n''}} leaves (or {{Math|''n''}} singleton trees), each representing a species in {{Math|''S''}}. Those {{Math|''n''}} leaves are referred as {{Math|''n''}} clusters. Then, we perform {{Math|''n''-1}} iterations. In each iteration, we identify two clusters {{Math|''C''<sub>1</sub>}} and {{Math|''C''<sub>2</sub>}} with the smallest average distance and merge them to form a bigger cluster {{Math|''C''}}. If we suppose {{Math|M}} is ultrametric, for any cluster {{Math|''C''}} created by the UPGMA algorithm, {{Math|''C''}} is a valid ultrametric tree.

===== Neighbor joining =====
Neighbor is a bottom-up clustering method. It takes a distance matrix specifying the distance between each pair of sequences. The algorithm starts with a completely unresolved tree, whose topology corresponds to that of a [[star network]], and iterates over the following steps until the tree is completely resolved and all branch lengths are known:

# Based on the current distance matrix calculate the matrix (defined below).
# Find the pair of distinct taxa i and j (i.e. with) for which has its lowest value. These taxa are joined to a newly created node, which is connected to the central node. 
# Calculate the distance from each of the [[Taxon|taxa]] in the pair to this new node.
# Calculate the distance from each of the taxa outside of this pair to the new node.
# Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.<ref>{{Cite journal |last=Saitou |first=Naruya |date=1987 |title=The neighbor-joining method: A new method for reconstructing phylogenetic trees |url=https://academic.oup.com/mbe/article/4/4/406/1029664?login=false |journal=Molecular Biology and Evolution |volume=4|issue=4 |pages=406–425 |doi=10.1093/oxfordjournals.molbev.a040454 |pmid=3447015 |doi-access=free }}</ref>

===== Fitch–Margoliash =====
The Fitch–Margoliash method uses a weighted [[least squares]] method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost.<ref>{{Cite journal |last=Fitch |first=Walter M. |date=1967 |title=Construction of Phylogenetic Trees: A method based on mutation distances as estimated from cytochrome c sequences is of general applicability. |url=https://www.science.org/doi/10.1126/science.155.3760.279 |journal=Science |volume=155|issue=3760 |pages=279–284 |doi=10.1126/science.155.3760.279 |pmid=5334057 }}</ref>