Editing Phylogenetics (section)

== Impacts of taxon sampling ==
In phylogenetic analysis, taxon sampling selects a small group of exemplar taxa to infer the evolutionary history of a clade.<ref name="incomplete taxon sampling">{{cite journal |last1=Rosenberg |first1=Michael |title=Incomplete taxon sampling is not a problem for phylogenetic inference |journal=Proceedings of the National Academy of Sciences |date=28 August 2001 |volume=98 |issue=19 |pages=10751–10756 |doi=10.1073/pnas.191248498 |pmid=11526218 |pmc=58547 |bibcode=2001PNAS...9810751R |doi-access=free }}</ref> This process is also known as [[stratified sampling]] or clade-based sampling.<ref name="taxonSampling">{{cite journal |last1=Rosenberg |first1=Michael |last2=Kumar |first2=Sudhir |title=Taxon Sampling, Bioinformatics, and Phylogenetics |journal=Systematic Biology|date=1 February 2003 |volume=52 |issue=1 |pages=119–124 |doi=10.1080/10635150390132894 |pmid=12554445 |pmc=2796430 |url=https://doi.org/10.1080/10635150390132894 |access-date=19 April 2023}}</ref> Judicious taxon sampling is important, given limited resources to compare and analyze every species within a diverse clade, and also given the computational limits of phylogenetic software.<ref name="incomplete taxon sampling" /> Poor taxon sampling may result in incorrect phylogenetic inferences.<ref name="taxonSampling" /> [[Long branch attraction]], in which nonrelated branches are incorrectly grouped by shared, homoplastic nucleotide sites, is an theoretical cause for inaccuracy <ref name="incomplete taxon sampling" />

[[File:Accuracy increase sites per taxon.png|thumb|Percentage of inter-ordinal branches reconstructed with a constant number of bases and four phylogenetic tree construction models; neighbor-joining (NJ), minimum evolution (ME), unweighted maximum parsimony (MP), and maximum likelihood (ML). Demonstrates phylogenetic analysis with fewer taxa and more genes per taxon matches more often with the replicable consensus tree. The dotted line demonstrates an equal accuracy increase between the two taxon sampling methods. Figure is property of Michael S. Rosenberg and Sudhir Kumar as presented in the journal article ''Taxon Sampling, Bioinformatics, and Phylogenomics''.<ref name="taxonSampling" />]]
There are debates if increasing the number of taxa sampled improves phylogenetic accuracy more than increasing the number of genes sampled per taxon. Differences in each method's sampling impact the number of nucleotide sites utilized in a sequence alignment, which may contribute to disagreements. For example, phylogenetic trees constructed utilizing a more significant number of total nucleotides are generally more accurate, as supported by phylogenetic trees' bootstrapping replicability from random sampling.

The graphic presented in ''Taxon Sampling, Bioinformatics, and Phylogenomics'', compares the correctness of phylogenetic trees generated using fewer taxa and more sites per taxon on the x-axis to more taxa and fewer sites per taxon on the y-axis. With fewer taxa, more genes are sampled amongst the taxonomic group; in comparison, with more taxa added to the taxonomic sampling group, fewer genes are sampled. Each method has the same total number of nucleotide sites sampled. Furthermore, the dotted line represents a 1:1 accuracy between the two sampling methods. As seen in the graphic, most of the plotted points are located below the dotted line, which indicates gravitation toward increased accuracy when sampling fewer taxa with more sites per taxon. The research performed utilizes four different phylogenetic tree construction models to verify the theory; neighbor-joining (NJ), minimum evolution (ME), unweighted maximum parsimony (MP), and maximum likelihood (ML). In the majority of models, sampling fewer taxon with more sites per taxon demonstrated higher accuracy.

Generally, with the alignment of a relatively equal number of total nucleotide sites, sampling more genes per taxon has higher bootstrapping replicability than sampling more taxa. However, unbalanced datasets within genomic databases make increasing the gene comparison per taxon in uncommonly sampled organisms increasingly difficult.<ref name="taxonSampling" />