Editing Expectation–maximization algorithm (section)

== Properties ==
Although an EM iteration does increase the observed data (i.e., marginal) likelihood function, no guarantee exists that the sequence converges to a [[maximum likelihood estimator]]. For [[bimodal distribution|multimodal distributions]], this means that an EM algorithm may converge to a [[local maximum]] of the observed data likelihood function, depending on starting values. A variety of heuristic or [[metaheuristic]] approaches exist to escape a local maximum, such as random-restart [[hill climbing]] (starting with several different random initial estimates <math>\boldsymbol\theta^{(t)}</math>), or applying [[simulated annealing]] methods.

EM is especially useful when the likelihood is an [[exponential family]], see Sundberg (2019, Ch. 8) for a comprehensive treatment:<ref>{{cite book |last1=Sundberg |first1=Rolf |title=Statistical Modelling by Exponential Families |date=2019 |publisher=Cambridge University Press |isbn=9781108701112}}</ref> the E step becomes the sum of expectations of [[sufficient statistic]]s, and the M step involves maximizing a linear function. In such a case, it is usually possible to derive [[closed-form expression]] updates for each step, using the Sundberg formula<ref>{{cite book |last1=Laird |first1=Nan |title=Encyclopedia of Statistical Sciences |chapter=Sundberg formulas |chapter-url=https://doi.org/10.1002/0471667196.ess2643.pub2 |publisher=Wiley |date=2006|doi=10.1002/0471667196.ess2643.pub2 |isbn=0471667196 }}</ref> (proved and published by Rolf Sundberg, based on unpublished results of [[Per Martin-Löf]] and [[Anders Martin-Löf]]).<ref name="Sundberg1971"/><ref name="Sundberg1976"/><ref name="Martin-Löf1966"/><ref name="Martin-Löf1970"/><ref name="Martin-Löf1974a"/><ref name="Martin-Löf1974b"/>

The EM method was modified to compute [[maximum a posteriori]] (MAP) estimates for [[Bayesian inference]] in the original paper by Dempster, Laird, and Rubin.

Other methods exist to find maximum likelihood estimates, such as [[gradient descent]], [[conjugate gradient]], or variants of the [[Gauss–Newton algorithm]]. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.