Editing Comparative genomics (section)

==Methods==
Computational approaches are necessary for genome comparisons, given the large amount of data encoded in genomes. Many tools are now publicly available, ranging from whole genome comparisons to [[gene expression]] analysis.<ref>{{cite book | vauthors=Cristianini N, Hahn M |url=http://www.computational-genomics.net |title=Introduction to Computational Genomics |publisher=Cambridge University Press |year=2006 |isbn=978-0-521-67191-0}}</ref> This includes approaches from systems and control, information theory, string analysis and data mining.<ref name="smash">{{cite journal |vauthors=Pratas D, Silva RM, Pinho AJ, Ferreira PJ |title=An alignment-free method to find and visualise rearrangements between pairs of DNA sequences |journal=Scientific Reports |volume=5 |pages=10203 |date=May 2015 |pmid=25984837 |pmc=4434998 |doi=10.1038/srep10203 |bibcode=2015NatSR...510203P }}</ref> Computational approaches will remain critical for research and teaching, especially when information science and genome biology is taught in conjunction.<ref>{{cite journal |vauthors=Via A, De Las Rivas J, Attwood TK, Landsman D, Brazas MD, Leunissen JA, Tramontano A, Schneider MV |title=Ten simple rules for developing a short bioinformatics training course |journal=PLOS Computational Biology |volume=7 |issue=10 |pages=e1002245 |date=October 2011 |pmid=22046119 |pmc=3203054 |doi=10.1371/journal.pcbi.1002245 |doi-access=free |bibcode=2011PLSCB...7E2245V}}</ref>
[[File:Phylogenetic tree of descendant species and reconstructed ancestors.png|thumb|upright=1.35|Phylogenetic tree of descendant species and reconstructed ancestors. The branch color represents breakpoint rates in RACFs (breakpoints per million years). Black branches represent nondetermined breakpoint rates. Tip colors depict assembly contiguity: black, scaffold-level genome assembly; green, chromosome-level genome assembly; yellow, chromosome-scale scaffold-level genome assembly. Numbers next to species names indicate diploid chromosome number (if known).<ref name="Evolution of the ancestral mammalia">{{cite journal |vauthors=Damas J, Corbo M, Kim J, Turner-Maier J, Farré M, Larkin DM, Ryder OA, Steiner C, Houck ML, Hall S, Shiue L, Thomas S, Swale T, Daly M, Korlach J, Uliano-Silva M, Mazzoni CJ, Birren BW, Genereux DP, Johnson J, Lindblad-Toh K, Karlsson EK, Nweeia MT, Johnson RN, Lewin HA | title = Evolution of the ancestral mammalian karyotype and syntenic regions |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=119 |issue=40 |pages=e2209139119 |date=October 2022 |pmid=36161960 |pmc=9550189 |doi=10.1073/pnas.2209139119 |doi-access=free |bibcode=2022PNAS..11909139D}}</ref>]]
Comparative genomics starts with basic comparisons of genome size and gene density. For instance, genome size is important for coding capacity and possibly for regulatory reasons. High gene density facilitates [[genome annotation]], analysis of environmental selection. By contrast, low gene density hampers the mapping of genetic disease as in the human genome.

=== Sequence alignment ===
[[Sequence alignment|Alignments]] are used to capture information about similar sequences such as ancestry, common evolutionary descent, or common structure and function. Alignments can be done for both nucleotide and protein sequences.<ref>{{cite book | vauthors = Altschul SF, Pop M | chapter = Sequence Alignment |date=2017 | chapter-url=http://www.ncbi.nlm.nih.gov/books/NBK464187/ | title = Handbook of Discrete and Combinatorial Mathematics | veditors = Rosen KH, Shier DR, Goddard W |edition=2nd |place=Boca Raton (FL) |publisher=CRC Press/Taylor & Francis |isbn=978-1-58488-780-5 |pmid=29206392 |access-date=2022-12-18 }}</ref><ref>{{cite book | vauthors = Prjibelski AD, Korobeynikov AI, Lapidus AL | chapter = Sequence Analysis |date=2019-01-01 | title = Encyclopedia of Bioinformatics and Computational Biology |pages=292–322 | veditors = Ranganathan S, Gribskov M, Nakai K, Schönbach C |place=Oxford |publisher=Academic Press |language=en |doi=10.1016/b978-0-12-809633-8.20106-4 |isbn=978-0-12-811432-2 | s2cid = 226247797 }}</ref> Alignments consist of local or global pairwise alignments, and multiple sequence alignments. One way to find global alignments is to use a dynamic programming algorithm known as [[Needleman–Wunsch algorithm|Needleman-Wunsch algorithm]]<nowiki/>whereas [[Smith–Waterman algorithm]] used to find local alignments. With the exponential growth of sequence databases and the emergence of longer sequences, there's a heightened interest in faster, approximate, or [https://science.umd.edu/labs/delwiche/bsci348s/lec/AlignHeuristic.html#:~:text=Heuristic%20Alignment&text=BLAST%20is%20a%20pairwise%20local,sequence%20databases%20such%20as%20GenBank. heuristic alignment] procedures. Among these, the '''FASTA''' and '''BLAST''' algorithms are prominent for local pairwise alignment. Recent years have witnessed the development of programs tailored to aligning lengthy sequences, such as '''MUMmer''' (1999), '''BLASTZ''' (2003), and '''AVID''' (2003). While BLASTZ adopts a local approach, MUMmer and AVID are geared towards global alignment. To harness the benefits of both local and global alignment approaches, one effective strategy involves integrating them. Initially, a rapid variant of BLAST known as '''BLAT''' is employed to identify homologous "anchor" regions. These anchors are subsequently scrutinized to identify sets exhibiting conserved order and orientation. Such sets of anchors are then subjected to alignment using a global strategy.

Additionally, ongoing efforts focus on optimizing existing algorithms to handle the vast amount of genome sequence data by enhancing their speed. Furthermore, '''MAVID''' stands out as another noteworthy pairwise alignment program specifically designed for aligning multiple genomes.

'''Pairwise Comparison:''' The Pairwise comparison of genomic sequence data is widely utilized in comparative gene prediction. Many studies in comparative functional genomics lean on pairwise comparisons, wherein traits of each gene are compared with traits of other genes across species. his method yields many more comparisons than unique observations, making each comparison dependent on others.<ref>{{cite journal | vauthors = Haubold B, Wiehe T | title = Comparative genomics: methods and applications | journal = Die Naturwissenschaften | volume = 91 | issue = 9 | pages = 405–421 | date = September 2004 | pmid = 15278216 | doi = 10.1007/s00114-004-0542-8 | bibcode = 2004NW.....91..405H }}</ref><ref>{{cite journal | vauthors = Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A | title = Pairwise comparisons across species are problematic when analyzing functional genomic data | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 115 | issue = 3 | pages = E409–E417 | date = January 2018 | pmid = 29301966 | pmc = 5776959 | doi = 10.1073/pnas.1707515115 | doi-access = free | bibcode = 2018PNAS..115E.409D }}</ref>

'''Multiple comparisons:''' The comparison of multiple genomes is a natural extension of pairwise inter-specific comparisons. Such comparisons typically aim to identify conserved regions across two phylogenetic scales: 1. Deep comparisons, often referred to as '''phylogenetic footprinting'''<ref>{{cite journal | vauthors = Hardison RC, Oeltjen J, Miller W | title = Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome | journal = Genome Research | volume = 7 | issue = 10 | pages = 959–966 | date = October 1997 | pmid = 9331366 | doi = 10.1101/gr.7.10.959 | doi-access = free }}</ref> reveal conservation across higher taxonomic units like vertebrates.<ref>{{cite journal | vauthors = Elgar G, Sandford R, Aparicio S, Macrae A, Venkatesh B, Brenner S | title = Small is beautiful: comparative genomics with the pufferfish (Fugu rubripes) | journal = Trends in Genetics | volume = 12 | issue = 4 | pages = 145–150 | date = April 1996 | pmid = 8901419 | doi = 10.1016/0168-9525(96)10018-4 }}</ref> 2. Shallow comparisons, recently termed 
'''Phylogenetic shadowing''',<ref>{{cite journal |vauthors=Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM |title=Phylogenetic shadowing of primate sequences to find functional regions of the human genome |journal=Science |volume=299 |issue=5611 |pages=1391–1394 |date=February 2003 |pmid=12610304 |doi=10.1126/science.1081331 |url=https://digital.library.unt.edu/ark:/67531/metadc779156/}}</ref> probe conservation across a group of closely related species. 
[[File:Genomic structural variation.png|thumb|upright=1.15 |Chromosome by chromosome variation of indicine and taurine cattle. The genomic structural differences on chromosome X between indicine (''Bos indicus'' – [[Nelore | Nelore cattle]]) and taurine cattle (''Bos taurus'' – [[Hereford cattle]]) were identified using the SyRI tool.]]

=== Whole-genome alignment ===
Whole-genome alignment (WGA) involves predicting evolutionary relationships at the nucleotide level between two or more genomes. It integrates elements of colinear sequence alignment and [[gene orthology]] prediction, presenting a greater challenge due to the vast size and intricate nature of whole genomes. Despite its complexity, numerous methods have emerged to tackle this problem because WGAs play a crucial role in various genome-wide analyses, such as phylogenetic inference, genome annotation, and function prediction.<ref>{{cite book | vauthors = Dewey CN | chapter = Whole-Genome Alignment | series = Methods in Molecular Biology | title = Evolutionary Genomics | volume = 855 | pages = 237–257 | date = 2012 | pmid = 22407711 | doi = 10.1007/978-1-61779-582-4_8 | publisher = Humana Press | isbn = 978-1-61779-581-7 | place = Totowa, NJ | veditors = Anisimova M }}</ref> Thereby, SyRI (Synteny and Rearrangement Identifier) is one such method that utilizes whole genome alignment and it is designed to identify both structural and sequence differences between two [[Sequence assembly|whole-genome assemblies]]. By taking WGAs as input, SyRI initially scans for disparities in genome structures. Subsequently, it identifies local sequence variations within both rearranged and non-rearranged (syntenic) regions.<ref>{{cite journal | doi=10.1186/s13059-019-1911-0 | doi-access=free | title=SyRI: Finding genomic rearrangements and local sequence differences from whole-genome assemblies | date=2019 | journal=Genome Biology | volume=20 | pmid=31842948 | vauthors = Goel M, Sun H, Jiao W, Schneeberger K | issue=1 | page=277 | pmc=6913012 }}</ref>

[[File:Betacoronavirus Phylogenetic Tree.png|thumb|upright=1.15 |Example of a phylogenetic tree created from an alignment of 250 unique spike protein sequences from the Betacoronavirus family.]]

=== Phylogenetic reconstruction ===
Another computational method for comparative genomics is phylogenetic reconstruction. It is used to describe evolutionary relationships in terms of common ancestors. The relationships are usually represented in a tree called a [[phylogenetic tree]]. Similarly, [[coalescent theory]] is a retrospective model to trace alleles of a gene in a population to a single ancestral copy shared by members of the population. This is also known as the [[most recent common ancestor]]. Analysis based on coalescence theory tries predicting the amount of time between the introduction of a mutation and a particular allele or gene distribution in a population. This time period is equal to how long ago the most recent common ancestor existed. The inheritance relationships are visualized in a form similar to a phylogenetic tree. Coalescence (or the gene genealogy) can be visualized using [[dendrogram]]s.<ref>{{cite journal | vauthors = Haubold B, Wiehe T | title = Comparative genomics: methods and applications | journal = Die Naturwissenschaften | volume = 91 | issue = 9 | pages = 405–421 | date = September 2004 | pmid = 15278216 | doi = 10.1007/s00114-004-0542-8 | s2cid = 2041895 | bibcode = 2004NW.....91..405H }}</ref>
[[File:Synteny.png|thumb|upright=1.5|Example of synteny block and break. Genes located on chromosomes of two species are denoted in letters. Each gene is associated with a number representing the species they belong to (species 1 or 2). Orthologous genes are connected by dashed lines and genes without an orthologous relationship are treated as gaps in synteny programs.<ref>{{cite journal |vauthors=Liu D, Hunt M, Tsai IJ |title=Inferring synteny between genome assemblies: a systematic evaluation |journal=BMC Bioinformatics |volume=19 |issue=1 |pages=26 |date=January 2018 |pmid=29382321 |pmc=5791376 |doi=10.1186/s12859-018-2026-4 |doi-access=free}}</ref>]]

=== Genome maps ===
An additional method in comparative genomics is [[genetic mapping]]. In genetic mapping, visualizing [[synteny]] is one way to see the preserved order of genes on chromosomes. It is usually used for chromosomes of related species, both of which result from a common ancestor.<ref>{{cite book | vauthors = Duran C, Edwards D, Batley J | title = Plant Genomics | chapter = Genetic Maps and the Use of Synteny | series = Methods in Molecular Biology | volume = 513 | pages = 41–55 | date = 2009 | pmid = 19347649 | doi = 10.1007/978-1-59745-427-8_3 | isbn = 978-1-58829-997-0 }}</ref> This and other methods can shed light on evolutionary history. A recent study used comparative genomics to reconstruct 16 ancestral [[karyotype]]s across the mammalian phylogeny. The computational reconstruction showed how chromosomes rearranged themselves during mammal evolution. It gave insight into conservation of select regions often associated with the control of developmental processes. In addition, it helped to provide an understanding of chromosome evolution and [[genetic diseases]] associated with DNA rearrangements.{{Citation needed|date=December 2022}}
[[File:Reconstruction of mammillian chromosomes.png|alt=Solid green squares indicate mammalian chromosomes maintained as a single synteny block (either as a single chromosome or fused with another MAM), with shades of the color indicating the fraction of the chromosome affected by intra-chromosomal rearrangements (the lightest shade is most affected). Split blocks demarcate mammalian chromosomes affected by inter-chromosomal rearrangements. Upper (green)triangles show the fraction of the chromosome affected by intra chromosomal rearrangements, and lower (red) triangles show the fraction affected by inter chromosomal rearrangements. Syntenic relationships of each MAM to the human genome are given at the right of the diagram. MAMX appears split in goat because its X chromosome is assembled as two separate fragments. BOR, boreoeutherian ancestor chromosome; EUA, Euarchontoglires ancestor chromo-some; EUC, Euarchonta ancestor chromosome; EUT, eutherian ancestor chromosome; PMT; Primatomorpha ancestor chromosome; PRT, primates (Hominidae) ancestor chromosome; THE, therian ancestor chromosome.|thumb|upright=1.5|Image from the study Evolution of the ancestral mammalian karyotype and syntenic regions. It is a Visualization of the evolutionary history of reconstructed mammalian chromosomes based on the human lineage.<ref name="Evolution of the ancestral mammalia"/>]]