Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sequence alignment
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Phylogenetic analysis== {{Main|Computational phylogenetics}} Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.<ref name=ortet>{{cite journal|author1=Ortet P |author2=Bastien O | year=2010 | title=Where Does the Alignment Score Distribution Shape Come from? | journal= Evolutionary Bioinformatics | volume=6| pages=159β187| pmid = 21258650| doi = 10.4137/EBO.S5875 | url=http://www.la-press.com/where-does-the-alignment-score-distribution-shape-come-from-article-a2393| pmc=3023300}}</ref> The field of [[phylogenetics]] makes extensive use of sequence alignments in the construction and interpretation of [[phylogenetic tree]]s, which are used to classify the evolutionary relationships between homologous [[gene]]s represented in the [[genome]]s of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young [[most recent common ancestor]], while low identity suggests that the divergence is more ancient. This approximation, which reflects the "[[molecular clock]]" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the [[coalescence (genetics)|coalescence]] time), assumes that the effects of mutation and [[natural selection|selection]] are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of [[DNA repair]] or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between [[silent mutation]]s that do not alter the meaning of a given [[codon]] and other mutations that result in a different [[amino acid]] being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes. Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly [[heuristic]] because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is [[NP-hard]].<ref name=felsenstein>{{cite book| author=Felsenstein J. | year=2004| title=Inferring Phylogenies | publisher= Sinauer Associates: Sunderland, MA | isbn=978-0-87893-177-4}}</ref> ===Assessment of significance=== Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that [[convergent evolution]] can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures. In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts. Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite book|author1=Altschul SF |author2=Gish W |chapter=Local alignment statistics |title=Computer Methods for Macromolecular Sequence Analysis | year=1996 | volume=266 | pages = 460β480|doi=10.1016/S0076-6879(96)66029-7|pmid=8743700 |series=Methods in Enzymology|isbn=9780121821678}}</ref><ref name=hartmann>{{cite journal| author=Hartmann AK| year=2002| title=Sampling rare events: statistics of local sequence alignments| journal= Phys. Rev. E| volume=65| page=056102|doi=10.1103/PhysRevE.65.056102| pmid=12059642| issue=5|arxiv=cond-mat/0108201|bibcode=2002PhRvE..65e6102H| s2cid=193085}}</ref><ref name=newberg>{{cite journal| author=Newberg LA | year=2008 | title=Significance of gapped sequence alignments | journal= J Comput Biol| volume=15| pages=1187β1194 | pmid = 18973434 | doi=10.1089/cmb.2008.0125| issue=9| pmc=2737730}}</ref><ref name=eddy>{{cite journal| author=Eddy SR| year=2008 | title=A probabilistic model of local sequence alignment that simplifies statistical significance estimation | journal= PLOS Comput Biol | volume=4| editor1-first=Burkhard| pages=e1000069 | pmid = 18516236| editor1-last=Rost | doi=10.1371/journal.pcbi.1000069| issue=5| pmc=2396288| last2=Rost| first2=Burkhard| bibcode=2008PLSCB...4E0069E| s2cid=15640896 | doi-access=free }}</ref><ref name=bastien>{{cite journal|author1=Bastien O |author2=Aude JC |author3=Roy S |author4=Marechal E | year=2004 | title=Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics | journal= Bioinformatics | volume=20| issue=4| pages=534β537| pmid = 14990449| doi = 10.1093/bioinformatics/btg440 | doi-access=free | citeseerx=10.1.1.602.6979 }}</ref><ref name=agrawal11>{{cite journal|author1=Agrawal A |author2=Huang X | year=2011| title=Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices|journal= IEEE/ACM Transactions on Computational Biology and Bioinformatics| volume=8| pages=194β205|doi=10.1109/TCBB.2009.69|pmid=21071807 | issue=1|s2cid=6559731 }}</ref><ref name=agrawal08>{{cite journal| author1=Agrawal A| author2=Brendel VP| author3=Huang X| year=2008| title=Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment| journal=International Journal of Computational Biology and Drug Design| volume=1| pages=347β367| doi=10.1504/IJCBDD.2008.022207| pmid=20063463| url=http://inderscience.metapress.com/content/1558538106522500/| issue=4| url-status=dead| archive-url=https://archive.today/20130128163812/http://inderscience.metapress.com/content/1558538106522500/| archive-date=28 January 2013| df=dmy-all| url-access=subscription}}</ref> ===Assessment of credibility=== Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.<ref name=NewbergLawrence2009>{{cite journal|author1=Newberg LA |author2=Lawrence CE | year=2009 | title=Exact Calculation of Distributions on Integers, with Application to Sequence Alignment | journal= J Comput Biol| volume=16| pages=1β18 | pmid = 19119992 | doi=10.1089/cmb.2008.0137| issue=1| pmc=2858568}}</ref> ===Scoring functions=== The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using [[substitution matrix|substitution matrices]] that reflect the probabilities of given character-to-character substitutions. A series of matrices called [[Point accepted mutation|PAM matrices]] (Point Accepted Mutation matrices, originally defined by [[Margaret Dayhoff]] and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as [[BLOSUM]] (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. [[Gap penalty|Gap penalties]] account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function. It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)