Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Hadamard transform
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Molecular phylogenetics (evolutionary biology) applications == The Hadamard transform can be used to estimate [[phylogenetic tree]]s from molecular data.<ref>{{Cite journal| last1=Hendy| first1=Michael D.| last2=Penny| first2=David|date=December 1989|title=A Framework for the Quantitative Study of Evolutionary Trees| url=https://academic.oup.com/sysbio/article-lookup/doi/10.2307/2992396| journal=Systematic Zoology|volume=38| issue=4| pages=297| doi=10.2307/2992396|jstor=2992396}}</ref><ref>{{Cite journal|last1=Hendy|first1=Michael D.|last2=Penny|first2=David| date=January 1993|title=Spectral analysis of phylogenetic data|url=http://link.springer.com/10.1007/BF02638451|journal=Journal of Classification| language=en|volume=10|issue=1|pages=5–24|doi=10.1007/BF02638451|s2cid=122466038|issn=0176-4268}}</ref><ref>Székely, L. A., Erdős, P. L., Steel, M. A., & Penny, D. (1993). A Fourier inversion formula for evolutionary trees. ''Applied mathematics letters'', ''6''(2), 13–16.</ref> [[Phylogenetics]] is the subfield of [[evolutionary biology]] focused on understanding the relationships among organisms. A Hadamard transform applied to a vector (or matrix) of site pattern frequencies obtained from a DNA [[multiple sequence alignment]] can be used to generate another vector that carries information about the tree topology. The invertible nature of the phylogenetic Hadamard transform also allows the calculation of site likelihoods from a tree topology vector, allowing one to use the Hadamard transform for [[maximum likelihood estimation]] of phylogenetic trees. However, the latter application is less useful than the transformation from the site pattern vector to the tree vector because there are other ways to calculate site likelihoods<ref>{{Cite journal|last=Felsenstein|first=Joseph|date=November 1981| title=Evolutionary trees from DNA sequences: A maximum likelihood approach|url=http://link.springer.com/10.1007/BF01734359| journal=Journal of Molecular Evolution|language=en|volume=17|issue=6|pages=368–376|doi=10.1007/BF01734359|pmid=7288891|bibcode=1981JMolE..17..368F | s2cid=8024924| issn=0022-2844}}</ref><ref>{{Citation|last=Stamatakis|first=Alexandros|title=A Review of Approaches for Optimizing Phylogenetic Likelihood Calculations|date=2019|url=http://link.springer.com/10.1007/978-3-030-10837-3_1|work=Bioinformatics and Phylogenetics|series=Computational Biology|volume=29|pages=1–19|editor-last=Warnow|editor-first=Tandy| place=Cham| publisher=Springer International Publishing|language=en|doi=10.1007/978-3-030-10837-3_1| isbn=978-3-030-10836-6|s2cid=145834168 | access-date=2020-10-10}}</ref> that are much more efficient. However, the invertible nature of the phylogenetic Hadamard transform does provide an elegant tool for mathematic phylogenetics.<ref>{{Cite journal|last1=Chor|first1=Benny|author1-link= Benny Chor |last2=Hendy|first2=Michael D.| last3=Holland|first3=Barbara R.|last4=Penny|first4=David|date=2000-10-01|title=Multiple Maxima of Likelihood in Phylogenetic Trees: An Analytic Approach|url=http://academic.oup.com/mbe/article/17/10/1529/956181|journal=Molecular Biology and Evolution| language=en|volume=17|issue=10|pages=1529–1541|doi=10.1093/oxfordjournals.molbev.a026252|pmid=11018159|issn=1537-1719|doi-access=free}}</ref><ref>{{Cite journal|last1=Matsen|first1=Frederick A.|last2=Steel|first2=Mike|date=2007-10-01|editor-last=Ané| editor-first=Cécile|editor-link= Cécile Ané |editor2-last=Sullivan|editor2-first=Jack|title=Phylogenetic Mixtures on a Single Tree Can Mimic a Tree of Another Topology|journal=Systematic Biology|language=en|volume=56|issue=5| pages=767–775| doi=10.1080/10635150701627304| pmid=17886146| issn=1076-836X|doi-access=free|arxiv=0704.2260}}</ref> The mechanics of the phylogenetic Hadamard transform involve the calculation of a vector <math>\gamma(T)</math> that provides information about the topology and branch lengths for tree <math>T</math> using the [[site pattern]] vector or matrix <math>s(T)</math>. <math display="block">\gamma(T) = H^{-1}(\ln(Hs(T)))</math> where <math>H</math> is the Hadamard matrix of the appropriate size. This equation can be rewritten as a series of three equations to simplify its interpretation: <math display="block">\begin{align} r &= H s(T) \\ \rho &= \ln r \\ \gamma(T) &= H^{-1}\rho \end{align}</math> The invertible nature of this equation allows one to calculate an expected site pattern vector (or matrix) as follows: <math display="block">s(T)=H^{-1}(\exp(H\gamma(T)))</math> We can use the Cavender–Farris–[[Jerzy Neyman|Neyman]] (CFN) two-state [[substitution model]] for DNA by encoding the [[nucleotide]]s as binary characters (the [[purine]]s A and G are encoded as R and the [[pyrimidine]]s C and T are encoded as Y). This makes it possible to encode the multiple sequence alignment as the site pattern vector <math>s(T)</math> that can be converted to a tree vector <math>\gamma(T)</math>, as shown in the following example: {| class="wikitable" |+ Example showing the Hadamard transform for a specific tree (values for worked example adapted from Waddell et al. 1997<ref>{{Cite journal|last1=Waddell|first1=Peter J| last2=Steel| first2=M.A| date=December 1997|title=General Time-Reversible Distances with Unequal Rates across Sites: Mixing Γ and Inverse Gaussian Distributions with Invariant Sites|journal=Molecular Phylogenetics and Evolution|language=en|volume=8| issue=3| pages=398–414 |doi=10.1006/mpev.1997.0452|pmid=9417897|bibcode=1997MolPE...8..398W }}</ref>) |- ! Index ! Binary pattern ! Alignment patterns ! <math>\gamma(T)</math> ! <math>\rho = H^{-1}\gamma(T)</math> ! <math>r = \exp(\rho)</math> ! <math>s(T) = H^{-1}\rho</math> |- | 0 | 0000 | RRRR and YYYY | −0.475 | 0 | 1 | 0.6479 |- | 1 | 0001 | RRRY and YYYR | 0.2 | −0.5 | 0.6065 | 0.1283 |- | 2 | 0010 | RRYR and YYRY | 0.025 | −0.15 | 0.8607 | 0.02 |- | 3* | 0011 | RRYY and YYRR | 0.025 | −0.45 | 0.6376 | 0.0226 |- | 4 | 0100 | RYRR and YRYY | 0.2 | −0.45 | 0.6376 | 0.1283 |- | 5* | 0101 | RYRY and YRYR | 0 | −0.85 | 0.4274 | 0.0258 |- | 6* | 0110 | RYYR and YRRY | 0 | −0.5 | 0.6065 | 0.0070 |- | 7 | 0111 | RYYY and YRRR | 0.025 | −0.9 | 0.4066 | 0.02 |} The example shown in this table uses the simplified three equation scheme and it is for a four taxon tree that can be written as ((A,B),(C,D)); in [[newick format]]. The site patterns are written in the order ABCD. This particular tree has two long terminal branches (0.2 [[transversion]] substitutions per site), two short terminal branches (0.025 transversion substitutions per site), and a short internal branch (0.025 transversion substitutions per site); thus, it would be written as ((A:0.025,B:0.2):0.025,(C:0.025,D:0.2)); in newick format. This tree will exhibit [[long branch attraction]] if the data are analyzed using the [[Maximum parsimony (phylogenetics)|maximum parsimony]] criterion (assuming the sequence analyzed is long enough for the observed site pattern frequencies to be close to the expected frequencies shown in the <math>s(T)=H^{-1}\rho</math> column). The long branch attraction reflects the fact that the expected number of site patterns with index 6 -- which support the tree ((A,C),(B,D)); -- exceed the expected number of site patterns that support the true tree (index 4). Obviously, the invertible nature of the phylogenetic Hadamard transform means that the tree vector means that the tree vector <math>\gamma(T)</math> corresponds to the correct tree. Parsimony analysis after the transformation is therefore [[Consistent estimator|statistically consistent]],<ref>{{Cite journal|last1=Steel|first1=M. A.|last2=Hendy|first2=M. D.|last3=Penny|first3=D.|date=1993-12-01|title=Parsimony Can Be Consistent!| url=https://academic.oup.com/sysbio/article-lookup/doi/10.1093/sysbio/42.4.581|journal=Systematic Biology| language=en| volume=42|issue=4|pages=581–587|doi=10.1093/sysbio/42.4.581|issn=1063-5157}}</ref> as would be a standard maximum likelihood analysis using the correct model (in this case the CFN model). Note that the site pattern with 0 corresponds to the sites that have not changed (after encoding the nucleotides as purines or pyrimidines). The indices with asterisks (3, 5, and 6) are "parsimony-informative", and. the remaining indices represent site patterns where a single taxon differs from the other three taxa (so they are the equivalent of terminal branch lengths in a standard maximum likelihood phylogenetic tree). If one wishes to use nucleotide data without recoding as R and Y (and ultimately as 0 and 1) it is possible to encode the site patterns as a matrix. If we consider a four-taxon tree there are a total of 256 site patterns (four nucleotides to the 4th power). However, symmetries of the [[Models of DNA evolution|Kimura three-parameter (or K81) model]] allow us to reduce the 256 possible site patterns for DNA to 64 patterns, making it possible to encode the nucleotide data for a four-taxon tree as an 8 × 8 matrix<ref name=":0">{{Cite journal|last1=Hendy|first1=M. D.|last2=Penny|first2=D.|last3=Steel|first3=M. A.| date=1994-04-12| title=A discrete Fourier analysis for evolutionary trees.|journal=Proceedings of the National Academy of Sciences| language=en| volume=91|issue=8|pages=3339–3343|doi=10.1073/pnas.91.8.3339|issn=0027-8424|pmc=43572|pmid=8159749|bibcode=1994PNAS...91.3339H |doi-access=free}}</ref> in a manner similar to the vector of 8 elements used above for transversion (RY) site patterns. This is accomplished by recoding the data using the [[Klein four-group]]: {| class="wikitable" |+ Klein four-group coding for phylogenetic Hadamard transform |- ! Nucleotide 1 ! Nucleotide 2 ! Nucleotide 3 ! Nucleotide 4 |- |A (0,0) |G (1,0) |C (0,1) |T (1,1) |- |C (0,0) |T (1,0) |A (0,1) |G (1,1) |- |G (0,0) |A (1,0) |T (0,1) |C (1,1) |- |T (0,0) |C (1,0) |G (0,1) |A (1,1) |} As with RY data, site patterns are indexed relative to the base in the arbitrarily chosen first taxon with the bases in the subsequent taxa encoded relative to that first base. Thus, the first taxon receives the bit pair (0,0). Using those bit pairs one can produce two vectors similar to the RY vectors and then populate the matrix using those vectors. This can be illustrated using the example from Hendy et al. (1994),<ref name=":0" /> which is based on a multiple sequence alignment of four primate hemoglobin pseudogenes: {| class="wikitable" |+ Example of encoded sequence alignment (from Hendy et al. 1994<ref name=":0" />) (values are counts out of 9879 sites) |- ! ! 0 ! 8 ! 16 ! 24 ! 32 ! 40 ! 48 ! 56 |- |0 |8988 |9 |10 |12 |24 | | |90 |- |1 |41 |9 | |** | | | | |- |2 |45 | |13 | | | | | |- |3 |54* | | |14 | | | |3 |- |4 |94 | | | |20 | | | |- |5 | | |1 | | | | | |- |6 |2 | | | |2 | | | |- |7 |356 | |1 | |1 | | |75 |} The much larger number of site patterns in column 0 reflects the fact that column 0 corresponds to [[Transition (genetics)|transition]] differences, which accumulate more rapidly than transversion differences in virtually all comparisons of genomic regions (and definitely accumulate more rapidly in the hemoglobin pseudogenes used for this worked example<ref>{{Cite journal| last1=Miyamoto|first1=M. M.|last2=Koop|first2=B. F.|last3=Slightom|first3=J. L.|last4=Goodman|first4=M.| last5=Tennant| first5=M. R.|date=1988-10-01|title=Molecular systematics of higher primates: genealogical relations and classification.| journal= Proceedings of the National Academy of Sciences| language=en|volume=85|issue=20| pages=7627–7631| doi=10.1073/pnas.85.20.7627| issn=0027-8424| pmc=282245|pmid=3174657|bibcode=1988PNAS...85.7627M |doi-access=free}}</ref>). If we consider the site pattern AAGG it would to binary pattern 0000 for the second element of the Klein group bit pair and 0011 for the first element. in this case binary pattern based on the first element first element corresponds to index 3 (so row 3 in column 0; indicated with a single asterisk in the table). The site patterns GGAA, CCTT, and TTCC would be encoded in the exact same way. The site pattern AACT would be encoded with binary pattern 0011 based on the second element and 0001 based on the first element; this yields index 1 for the first element and index 3 for the second. The index based on the second Klein group bit pair is multiplied by 8 to yield the column index (in this case it would be column 24) The cell that would include the count of AACT site patterns is indicated with two asterisks; however, the absence of a number in the example indicates that the sequence alignment include no AACT site patterns (likewise, CCAG, GGTC, and TTGA site patterns, which would be encoded in the same way, are absent).
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)