Editing Hidden Markov model (section)

=== Bayesian modeling of the transitions probabilities ===
Hidden Markov models are [[generative model]]s, in which the [[joint distribution]] of observations and hidden states, or equivalently both the [[prior distribution]] of hidden states (the ''transition probabilities'') and [[conditional distribution]] of observations given states (the ''emission probabilities''), is modeled. The above algorithms implicitly assume a [[Uniform distribution (continuous)|uniform]] prior distribution over the transition probabilities. However, it is also possible to create hidden Markov models with other types of prior distributions. An obvious candidate, given the categorical distribution of the transition probabilities, is the [[Dirichlet distribution]], which is the [[conjugate prior]] distribution of the categorical distribution. Typically, a symmetric Dirichlet distribution is chosen, reflecting ignorance about which states are inherently more likely than others. The single parameter of this distribution (termed the ''concentration parameter'') controls the relative density or sparseness of the resulting transition matrix. A choice of 1 yields a uniform distribution. Values greater than 1 produce a dense matrix, in which the transition probabilities between pairs of states are likely to be nearly equal. Values less than 1 result in a sparse matrix in which, for each given source state, only a small number of destination states have non-negligible transition probabilities. It is also possible to use a two-level prior Dirichlet distribution, in which one Dirichlet distribution (the upper distribution) governs the parameters of another Dirichlet distribution (the lower distribution), which in turn governs the transition probabilities. The upper distribution governs the overall distribution of states, determining how likely each state is to occur; its concentration parameter determines the density or sparseness of states. Such a two-level prior distribution, where both concentration parameters are set to produce sparse distributions, might be useful for example in [[unsupervised learning|unsupervised]] [[part-of-speech tagging]], where some parts of speech occur much more commonly than others; learning algorithms that assume a uniform prior distribution generally perform poorly on this task. The parameters of models of this sort, with non-uniform prior distributions, can be learned using [[Gibbs sampling]] or extended versions of the [[expectation-maximization algorithm]].

An extension of the previously described hidden Markov models with [[Dirichlet distribution|Dirichlet]] priors uses a [[Dirichlet process]] in place of a Dirichlet distribution. This type of model allows for an unknown and potentially infinite number of states. It is common to use a two-level Dirichlet process, similar to the previously described model with two levels of Dirichlet distributions. Such a model is called a ''hierarchical Dirichlet process hidden Markov model'', or ''HDP-HMM'' for short. It was originally described under the name "Infinite Hidden Markov Model"<ref>Beal, Matthew J., Zoubin Ghahramani, and Carl Edward Rasmussen. "The infinite hidden Markov model." Advances in neural information processing systems 14 (2002): 577-584.</ref> and was further formalized in "Hierarchical Dirichlet Processes".<ref>Teh, Yee Whye, et al. "Hierarchical dirichlet processes." Journal of the American Statistical Association 101.476 (2006).</ref>