Editing Nucleic acid sequence (section)

== Sequence analysis ==
{{Main|Sequence analysis}}

Digital genetic sequences may be analyzed using the tools of [[bioinformatics]] to attempt to determine its function.

=== Genetic testing ===
{{Main|Genetic testing}}

The DNA in an organism's [[genome]] can be analyzed to [[medical diagnosis|diagnose]] vulnerabilities to inherited [[disease]]s, and can also be used to determine a child's paternity (genetic father) or a person's [[ancestry]]. Normally, every person carries two variations of every [[gene]], one inherited from their mother, the other inherited from their father. The [[human genome]] is believed to contain around 20,000–25,000 genes. In addition to studying [[chromosome]]s to the level of individual genes, genetic testing in a broader sense includes [[biochemical]] tests for the possible presence of [[genetic disease]]s, or mutant forms of genes associated with increased risk of developing genetic disorders.

Genetic testing identifies changes in chromosomes, genes, or proteins.<ref>{{Cite web |date=16 March 2015 |title=What is genetic testing? |url=http://www.ghr.nlm.nih.gov/handbook/testing/genetictesting |url-status=dead |archive-url=https://web.archive.org/web/20060529002711/http://ghr.nlm.nih.gov/handbook/testing/genetictesting |archive-date=29 May 2006 |access-date=19 May 2010 |website=Genetics Home Reference}}</ref> Usually, testing is used to find changes that are associated with inherited disorders. The results of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed.<ref>{{Cite web |title=Genetic Testing |url=https://www.nlm.nih.gov/medlineplus/genetictesting.html |website=nih.gov}}</ref><ref>{{Cite web |date=2008-09-11 |title=Definitions of Genetic Testing |url=http://www.eurogentest.org/patient/public_health/info/public/unit3/DefinitionsGeneticTesting-3rdDraf18Jan07.xhtml |url-status=dead |archive-url=https://web.archive.org/web/20090204181251/http://eurogentest.org/patient/public_health/info/public/unit3/DefinitionsGeneticTesting-3rdDraf18Jan07.xhtml |archive-date=February 4, 2009 |access-date=2008-08-10 |website=Definitions of Genetic Testing (Jorge Sequeiros and Bárbara Guimarães) |publisher=EuroGentest Network of Excellence Project}}</ref>

=== Sequence alignment ===
{{Main|Sequence alignment}}

In bioinformatics, a sequence alignment is a way of arranging the sequences of [[DNA]], [[RNA]], or [[protein]] to identify regions of similarity that may be due to functional, [[structural biology|structural]], or [[evolution]]ary relationships between the sequences.<ref name="mount">{{Cite book |last=Mount DM. |title=Bioinformatics: Sequence and Genome Analysis |publisher=Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY |year=2004 |isbn=0-87969-608-7 |edition=2nd}}</ref> If two sequences in an alignment share a common ancestor, mismatches can be interpreted as [[point mutation]]s and gaps as [[Insertion (genetics)|insertion]] or [[Deletion (genetics)|deletion mutations]] ([[indel]]s) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between [[amino acid]]s occupying a particular position in the sequence can be interpreted as a rough measure of how [[conservation (genetics)|conserved]] a particular region or [[sequence motif]] is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose [[side chain]]s have similar biochemical properties) in a particular region of the sequence, suggest<ref name="predict">{{Cite journal |last=Ng |first=P. C. |last2=Henikoff |first2=S. |year=2001 |title=Predicting Deleterious Amino Acid Substitutions |journal=Genome Research |volume=11 |issue=5 |pages=863–74 |doi=10.1101/gr.176601 |pmc=311071 |pmid=11337480}}</ref> that this region has structural or functional importance. Although DNA and RNA [[nucleotide]] bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.<ref>{{Cite journal |last=Witzany |first=G |year=2016 |title=Crucial steps to life: From chemical reactions to code using agents |url=https://philpapers.org/rec/GUECST-2 |journal=Biosystems |volume=140 |pages=49–57 |bibcode=2016BiSys.140...49W |doi=10.1016/j.biosystems.2015.12.007 |pmid=26723230 |s2cid=30962295}}</ref>

[[Computational phylogenetics]] makes extensive use of sequence alignments in the construction and interpretation of [[phylogenetic tree]]s, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young [[most recent common ancestor]], while low identity suggests that the divergence is more ancient. This approximation, which reflects the "[[molecular clock]]" hypothesis that a roughly constant [[rate of evolution|rate of evolutionary change]] can be used to extrapolate the elapsed time since two genes first diverged (that is, the [[coalescence (genetics)|coalescence]] time), assumes that the effects of mutation and [[natural selection|selection]] are constant across sequence lineages. Therefore, it does not account for possible differences among organisms or species in the rates of [[DNA repair]] or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between [[silent mutation]]s that do not alter the meaning of a given [[codon]] and other mutations that result in a different [[amino acid]] being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.

=== Sequence motifs ===
{{Main|Sequence motif}}

Frequently the primary structure encodes motifs that are of functional importance. Some examples of sequence motifs are: the C/D<!--
--><ref>{{Cite journal |last=Samarsky |first=DA |last2=Fournier MJ |last3=Singer RH |last4=Bertrand E |year=1998 |title=The snoRNA box C/D motif directs nucleolar targeting and also couples snoRNA synthesis and localization |journal=The EMBO Journal |volume=17 |issue=13 |pages=3747–57 |doi=10.1093/emboj/17.13.3747 |pmc=1170710 |pmid=9649444}}</ref>
and H/ACA boxes<!--
--><ref>{{Cite journal |last=Ganot |first=Philippe |last2=Caizergues-Ferrer |first2=Michèle |last3=Kiss |first3=Tamás |date=1 April 1997 |title=The family of box ACA small nucleolar RNAs is defined by an evolutionarily conserved secondary structure and ubiquitous sequence elements essential for RNA accumulation |journal=[[Genes & Development]] |volume=11 |issue=7 |pages=941–56 |doi=10.1101/gad.11.7.941 |pmid=9106664 |doi-access=free}}</ref>
of [[snoRNA]]s, [[LSm|Sm binding site]] found in spliceosomal RNAs such as [[U1 spliceosomal RNA|U1]], [[U2 spliceosomal RNA|U2]], [[U4 spliceosomal RNA|U4]], [[U5 spliceosomal RNA|U5]], [[U6 spliceosomal RNA|U6]], [[U12 minor spliceosomal RNA|U12]] and [[Small nucleolar RNA U3|U3]], the [[Shine-Dalgarno sequence]],<!--
--><ref>{{Cite journal |last=Shine J, Dalgarno L |author-link=John Shine |author-link2=Lynn Dalgarno |year=1975 |title=Determinant of cistron specificity in bacterial ribosomes |journal=Nature |volume=254 |issue=5495 |pages=34–38 |bibcode=1975Natur.254...34S |doi=10.1038/254034a0 |pmid=803646 |s2cid=4162567}}</ref>
the [[Kozak consensus sequence]]<!--
--><ref name="Kozak1987">{{Cite journal |last=Kozak M |date=October 1987 |title=An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs |journal=Nucleic Acids Res. |volume=15 |issue=20 |pages=8125–48 |doi=10.1093/nar/15.20.8125 |pmc=306349 |pmid=3313277}}</ref>
and the [[RNA polymerase III|RNA polymerase III terminator]]<!--
-->.<ref name="pmid6263489">{{Cite journal |last=Bogenhagen DF, Brown DD |year=1981 |title=Nucleotide sequences in Xenopus 5S DNA required for transcription termination. |journal=Cell |volume=24 |issue=1 |pages=261–70 |doi=10.1016/0092-8674(81)90522-5 |pmid=6263489 |s2cid=9982829}}</ref>

=== Sequence entropy ===

In [[bioinformatics]], a sequence entropy, also known as sequence complexity or information profile,<ref name="glance">{{Cite journal |last=Pinho |first=A |last2=Garcia, S |last3=Pratas, D |last4=Ferreira, P |date=Nov 21, 2013 |title=DNA Sequences at a Glance. |journal=PLOS ONE |volume=8 |issue=11 |pages=e79922 |bibcode=2013PLoSO...879922P |doi=10.1371/journal.pone.0079922 |pmc=3836782 |pmid=24278218 |doi-access=free}}</ref> is a numerical sequence providing a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The manipulations of the information profiles enable the analysis of the sequences using alignment-free techniques, such as for example in motif and rearrangements detection.<ref name="glance" /><ref name="rearrang">{{Cite journal |last=Pratas |first=D |last2=Silva, R |last3=Pinho, A |last4=Ferreira, P |date=May 18, 2015 |title=An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. |journal=Scientific Reports |volume=5 |pages=10203 |bibcode=2015NatSR...510203P |doi=10.1038/srep10203 |pmc=4434998 |pmid=25984837}}</ref><ref name="troy">{{Cite journal |last=Troyanskaya |first=O |last2=Arbell, O |last3=Koren, Y |last4=Landau, G |last5=Bolshoy, A |date=2002 |title=Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity. |journal=Bioinformatics |volume=18 |pages=679–88 |doi=10.1093/bioinformatics/18.5.679 |pmid=12050064 |doi-access=free |number=5}}</ref>