Editing Baum–Welch algorithm (section)

==Applications==

===Speech recognition===
Hidden Markov Models were first applied to speech recognition by [[James K. Baker]] in 1975.<ref>{{Cite journal |last1=Baker |first1=James K. |author-link1=James K. Baker |doi=10.1109/TASSP.1975.1162650 |title=The DRAGON system—An overview |journal=IEEE Transactions on Acoustics, Speech, and Signal Processing |volume=23 |pages=24–29 |year=1975 }}</ref> Continuous speech recognition occurs by the following steps, modeled by a HMM. Feature analysis is first undertaken on temporal and/or spectral features of the speech signal. This produces an observation vector. The feature is then compared to all sequences of the speech recognition units. These units could be [[phonemes]], syllables, or whole-word units. A lexicon decoding system is applied to constrain the paths investigated, so only words in the system's lexicon (word dictionary) are investigated. Similar to the lexicon decoding, the system path is further constrained by the rules of grammar and syntax. Finally, semantic analysis is applied and the system outputs the recognized utterance. A limitation of many HMM applications to speech recognition is that the current state only depends on the state at the previous time-step, which is unrealistic for speech as dependencies are often several time-steps in duration.<ref>{{cite journal |last=Rabiner |first=Lawrence |title=A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition |journal=Proceedings of the IEEE |date=February 1989 |volume=77 |issue=2 |pages=257–286 |doi=10.1109/5.18626 |citeseerx=10.1.1.381.3454 |s2cid=13618539 }}</ref> The Baum–Welch algorithm also has extensive applications in solving HMMs used in the field of speech synthesis.<ref>{{cite journal |last1=Tokuda |first1=Keiichi |first2=Takayoshi |last2=Yoshimura |first3=Takashi |last3=Masuko |first4=Takao |last4=Kobayashi |first5=Tadashi |last5=Kitamura |title=Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis |journal=IEEE International Conference on Acoustics, Speech, and Signal Processing |year=2000 |volume=3 }}</ref>

===Cryptanalysis===
The Baum–Welch algorithm is often used to estimate the parameters of HMMs in deciphering hidden or noisy information and consequently is often used in [[cryptanalysis]]. In data security an observer would like to extract information from a data stream without knowing all the parameters of the transmission. This can involve reverse engineering a [[Coding theory|channel encoder]].<ref>{{cite journal |last1=Dingel |first1=Janis |first2=Joachim |last2=Hagenauer |title=Parameter Estimation of a Convolutional Encoder from Noisy Observations |journal=IEEE International Symposium on Information Theory |date=24 June 2007 }}</ref> HMMs and as a consequence the Baum–Welch algorithm have also been used to identify spoken phrases in encrypted VoIP calls.<ref>{{cite journal |last1=Wright |first1=Charles |first2=Lucas |last2=Ballard |first3=Scott |last3=Coull |first4=Fabian |last4=Monrose |first5=Gerald |last5=Masson |title=Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations |journal=IEEE International Symposium on Security and Privacy |year=2008 }}</ref> In addition HMM cryptanalysis is an important tool for automated investigations of cache-timing data. It allows for the automatic discovery of critical algorithm state, for example key values.<ref>{{cite book |last1=Brumley |first1=Bob |first2=Risto |last2=Hakala |title=Advances in Cryptology – ASIACRYPT 2009 |chapter=Cache-Timing Template Attacks |year=2009 |volume=5912 |pages=667–684 |doi=10.1007/978-3-642-10366-7_39 |series=Lecture Notes in Computer Science |isbn=978-3-642-10365-0 }}</ref>

===Applications in bioinformatics===

====Finding genes====

=====Prokaryotic=====
The [[GLIMMER]] (Gene Locator and Interpolated Markov ModelER) software was an early [[locus (genetics)|gene-finding]] program used for the identification of coding regions in [[prokaryotic]] DNA.<ref name="GLIMMER paper">{{cite journal |last1=Salzberg |first1=Steven |first2=Arthur L. |last2=Delcher |first3=Simon |last3=Kasif |first4=Owen |last4=White |title=Microbial gene identification using interpolated Markov Models |journal=Nucleic Acids Research |year=1998 |volume=26 |issue=2 |pages=544–548 |doi=10.1093/nar/26.2.544 |pmid=9421513 |pmc=147303 }}</ref><ref name="GLIMMER web">{{cite web |title=Glimmer: Microbial Gene-Finding System |url=http://ccb.jhu.edu/software/glimmer/index.shtml |publisher=Johns Hopkins University - Center for Computational Biology }}</ref> GLIMMER uses Interpolated Markov Models (IMMs) to identify the [[exon|coding regions]] and distinguish them from the [[introns|noncoding DNA]]. The latest release (GLIMMER3) has been shown to have increased [[specificity (statistics)|specificity]] and accuracy compared with its predecessors with regard to predicting translation initiation sites, demonstrating an average 99% accuracy in locating 3' locations compared to confirmed genes in prokaryotes.<ref>{{cite journal |last1=Delcher |first1=Arthur |first2=Kirsten A. |last2=Bratke |first3=Edwin C. |last3=Powers |first4=Steven L. |last4=Salzberg |title=Identifying bacterial genes and endosymbiont DNA with Glimmer |journal=Bioinformatics |year=2007 |volume=23 |issue=6 |pages=673–679 |doi=10.1093/bioinformatics/btm009 |pmid=17237039 |pmc=2387122 }}</ref>

=====Eukaryotic=====
The [[GENSCAN]] webserver is a gene locator capable of analyzing [[eukaryotic]] sequences up to one million [[base-pairs]] (1 Mbp) long.<ref>{{cite web |last=Burge |first=Christopher |title=The GENSCAN Web Server at MIT |url=http://genes.mit.edu/GENSCAN.html |access-date=2 October 2013 |archive-url=https://web.archive.org/web/20130906115338/http://genes.mit.edu/GENSCAN.html |archive-date=6 September 2013 |url-status=dead }}</ref> GENSCAN utilizes a general inhomogeneous, three periodic, fifth order Markov model of DNA coding regions. Additionally, this model accounts for differences in gene density and structure (such as intron lengths) that occur in different [[Isochore (genetics)|isochores]]. While most integrated gene-finding software (at the time of GENSCANs release) assumed input sequences contained exactly one gene, GENSCAN solves a general case where partial, complete, or multiple genes (or even no gene at all) is present.<ref>{{cite journal |last1=Burge |first1=Chris |first2=Samuel |last2=Karlin |title=Prediction of Complete Gene Structures in Human Genomic DNA |journal=Journal of Molecular Biology |year=1997 |volume=268 |pages=78–94 |doi=10.1006/jmbi.1997.0951 |pmid=9149143 |issue=1 |citeseerx=10.1.1.115.3107 }}</ref> GENSCAN was shown to exactly predict exon location with 90% accuracy with 80% specificity compared to an annotated database.<ref>{{cite journal |last1=Burge |first1=Christopher |first2=Samuel |last2=Karlin |title=Finding the Genes in Genomic DNA |journal=Current Opinion in Structural Biology |year=1998 |volume=8 |issue=3 |pages=346–354 |doi=10.1016/s0959-440x(98)80069-9 |pmid=9666331 |doi-access=free }}</ref>

====Copy-number variation detection====
[[Copy-number variation]]s (CNVs) are an abundant form of genome structure variation in humans. A discrete-valued bivariate HMM (dbHMM) was used assigning chromosomal regions to seven distinct states: unaffected regions, deletions, duplications and four transition states.  Solving this model using Baum-Welch demonstrated the ability to predict the location of CNV breakpoint to approximately 300 bp from [[DNA microarray|micro-array experiments]].<ref>{{cite journal |last1=Korbel |first1=Jan |author-link=Jan O. Korbel |first2=Alexander |last2=Urban |first3=Fabien |last3=Grubert |first4=Jiang |last4=Du |first5=Thomas |last5=Royce |first6=Peter |last6=Starr |first7=Guoneng |last7=Zhong |first8=Beverly |last8=Emanuel |first9=Sherman |last9=Weissman |first10=Michael |last10=Snyder |first11=Marg |last11=Gerstein |title=Systematic prediction and validation of breakpoints associated with copy-number variations in the human genome |journal=Proceedings of the National Academy of Sciences of the United States of America |date=12 June 2007 |volume=104 |issue=24 |pages=10110–5 |doi=10.1073/pnas.0703834104 |pmid=17551006 |pmc=1891248 |bibcode=2007PNAS..10410110K |doi-access=free }}</ref> This magnitude of resolution enables more precise correlations between different CNVs and [[structural variations|across populations]] than previously possible, allowing the study of CNV population frequencies. It also demonstrated a [[Mendelian inheritance|direct inheritance pattern for a particular CNV]].