Editing Baum–Welch algorithm (section)

===Algorithm===
Set <math>\theta = (A, B, \pi)</math> with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum.

====Forward procedure====
Let <math>\alpha_i(t)=P(Y_1=y_1,\ldots,Y_t=y_t,X_t=i\mid\theta)</math>, the probability of seeing the observations <math>y_1,y_2,\ldots,y_t</math> and being in state <math>i</math> at time <math>t</math>. This is found recursively:
#<math>\alpha_i(1)=\pi_i b_i(y_1),</math>
#<math>\alpha_i(t+1)=b_i(y_{t+1}) \sum_{j=1}^N \alpha_j(t) a_{ji}.</math>

Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences.<ref>{{cite web|url=https://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/tutorial%20on%20hmm%20and%20applications.pdf|title=A Tutorial on Hidden Markov Models and Selected Applications in Speech recognition|last=Rabiner|first=Lawrence|date=February 1989|publisher=Proceedings of the IEEE|access-date=29 November 2019}}</ref> However, this can be avoided in a slightly modified algorithm by scaling <math>\alpha</math> in the forward and <math>\beta</math> in the backward procedure below.

====Backward procedure====
Let <math>\beta_i(t)=P(Y_{t+1}=y_{t+1},\ldots,Y_T=y_{T}\mid X_t=i,\theta)</math> that is the probability of the ending partial sequence <math>y_{t+1},\ldots,y_T</math> given starting state <math>i</math> at time <math>t</math>. We calculate <math>\beta_i(t)</math> as,
# <math>\beta_i(T)=1,</math>
# <math>\beta_i(t)=\sum_{j=1}^N \beta_j(t+1) a_{ij} b_j(y_{t+1}).</math>

====Update====
We can now calculate the temporary variables, according to Bayes' theorem:
:<math>\gamma_i(t)=P(X_t=i\mid Y,\theta) = \frac{P(X_t=i,Y\mid\theta)}{P(Y\mid\theta)} = \frac{\alpha_i(t)\beta_i(t)}{\sum_{j=1}^N \alpha_j(t)\beta_j(t)},</math>
which is the probability of being in state <math>i</math> at time <math>t</math> given the observed sequence <math>Y</math> and the parameters <math>\theta</math>
:<math>\xi_{ij}(t)=P(X_t=i,X_{t+1}=j\mid Y,\theta) = \frac{P(X_t=i,X_{t+1}=j,Y\mid\theta)}{P(Y\mid\theta)} = \frac{\alpha_i(t) a_{ij} \beta_j(t+1) b_j(y_{t+1})}{\sum_{k=1}^N \sum_{w=1}^N \alpha_k(t) a_{kw} \beta_w(t+1) b_w(y_{t+1}) }, </math>
which is the probability of being in state <math>i</math> and <math>j</math> at times <math>t</math> and <math>t+1</math> respectively given the observed sequence <math>Y</math> and parameters <math>\theta</math>.

The denominators of <math>\gamma_i(t)</math> and <math>\xi_{ij}(t)</math> are the same ; they represent the probability of making the observation <math>Y</math> given the parameters <math>\theta</math>.

The parameters of the  hidden Markov model <math>\theta</math> can now be updated:
*<math>\pi_i^* = \gamma_i(1),</math>
which is the expected frequency spent in state <math>i</math> at time <math>1</math>.
*<math>a_{ij}^*=\frac{\sum^{T-1}_{t=1}\xi_{ij}(t)}{\sum^{T-1}_{t=1}\gamma_i(t)},</math>
which is the expected number of transitions from state ''i'' to state ''j'' compared to the expected total number of transitions away from state ''i''. To clarify, the number of transitions away from state ''i'' does not mean transitions to a different state ''j'', but to any state including itself. This is equivalent to the number of times state ''i'' is observed in the sequence from ''t''&nbsp;=&nbsp;1 to ''t''&nbsp;=&nbsp;''T''&nbsp;−&nbsp;1.
*<math>b_i^*(v_k)=\frac{\sum^T_{t=1} 1_{y_t=v_k} \gamma_i(t)}{\sum^T_{t=1} \gamma_i(t)},</math>
where
:<math>
1_{y_t=v_k}=
\begin{cases}
1 & \text{if } y_t=v_k,\\
0 & \text{otherwise}
\end{cases}
</math>
is an indicator function, and <math>b_i^*(v_k)</math> is the expected number of times the output observations have been equal to <math>v_k</math> while in state <math>i</math> over the expected total number of times in state <math>i</math>.

These steps are now repeated iteratively until a desired level of convergence.

'''Note:''' It is possible to over-fit a particular data set. That is, <math>P(Y\mid\theta_\text{final}) > P(Y \mid \theta_\text{true}) </math>. The algorithm also does '''not''' guarantee a global maximum.

====Multiple sequences====

The algorithm described thus far assumes a single observed sequence <math>Y = y_1, \ldots, y_T</math>. However, in many situations, there are several sequences observed: <math>Y_1, \ldots, Y_R</math>. In this case, the information from all of the observed sequences must be used in the update of the parameters <math>A</math>, <math>\pi</math>, and <math>b</math>. Assuming that you have computed <math>\gamma_{ir}(t)</math> and <math>\xi_{ijr}(t)</math> for each sequence <math>y_{1,r},\ldots,y_{N_r,r}</math>, the parameters can now be updated:
*<math>\pi_i^* = \frac{\sum_{r=1}^{R}\gamma_{ir}(1)}{R}</math>
*<math>a_{ij}^*=\frac{\sum_{r=1}^{R} \sum^{T-1}_{t=1}\xi_{ijr}(t)}{\sum_{r=1}^{R} \sum^{T-1}_{t=1}\gamma_{ir}(t)},</math>
*<math>b_i^*(v_k)=\frac{\sum_{r=1}^{R} \sum^T_{t=1} 1_{y_{tr}=v_k} \gamma_{ir}(t)}{\sum_{r=1}^{R} \sum^T_{t=1} \gamma_{ir}(t)},</math>
where
:<math>
1_{y_{tr}=v_k}=
\begin{cases}
1 & \text{if } y_{t,r}=v_k,\\
0 & \text{otherwise}
\end{cases}
</math>
is an indicator function