Editing Minimum description length (section)

==Statistical MDL learning==

Any set of data can be represented by a string of [[symbols]] from a finite (say, [[binary numeral system|binary]]) [[alphabet]].

<blockquote>
[The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to [[data compression|compress the data]], i.e. to describe it using fewer symbols than needed to describe the data literally. (Grünwald, 2004)<ref name="peter">{{cite journal |last1=Grunwald |first1=Peter |title=A tutorial introduction to the minimum description length principle |date=June 2004 |arxiv=math/0406077 |bibcode=2004math......6077G }}</ref>
</blockquote>

Based on this, in 1978, Jorma Rissanen published an MDL learning algorithm using [[Entropy (information theory)|the statistical notion of information]] rather than algorithmic information.  Over the past 40 years this has developed into a rich theory of statistical and machine learning procedures with connections to [[Bayesian model averaging|Bayesian model selection and averaging]], penalization methods such as [[Lasso (statistics)|Lasso]] and [[Ridge regression|Ridge]], and so on—Grünwald and Roos (2020) give an introduction including all modern developments.<ref>{{Cite journal|last1=Grünwald|first1=Peter|last2=Roos|first2=Teemu|date=2020|title=Minimum Description Length Revisited|journal=International Journal of Mathematics for Industry|volume=11|issue=1|doi=10.1142/S2661335219300018|s2cid=201314867|doi-access=free|hdl=10138/317252|hdl-access=free}}</ref> Rissanen started out with this idea: all statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to ''statistically'' compress the data most. Like other statistical methods, it can be used for learning the parameters of a model using some data. Usually though, standard statistical methods assume  that the general form of a model is fixed. MDL's main strength is that it can also be used for selecting the general form of a model and its parameters. The quantity of interest (sometimes just a model, sometimes just parameters, sometimes both at the same time)  is called a hypothesis. The basic idea is then to consider the [[Lossless compression|(lossless)]] ''two-stage code'' that encodes data <math>D</math> with length <math> {L(D)} </math> by first encoding a hypothesis <math>H</math> in the set of considered hypotheses <math>{\cal H}</math>  and then coding <math>D</math> "with the help of" <math>H</math>; in the simplest context this just means "encoding the deviations of the data from the predictions made by <math>H</math>:

<math display="block"> {L(D)} = \min_{H \in {\cal H}} \ (\ L(H) + L(D|H) \ ) \ </math>

The <math>H</math> achieving this minimum is then viewed as the best explanation of data <math>D</math>. As a simple example, take a regression problem: the data <math>D</math> could consist of a sequence of points <math>D = (x_1,y_1), \ldots, (x_n,y_n)</math>, the set <math> {\cal H} </math> could be the set of all polynomials from <math>X</math> to <math>Y</math>. To describe a polynomial <math>H</math> of degree (say) <math>k</math>, one would first have to discretize the parameters to some precision; one would then have to describe this precision (a natural number); next, one would have to describe the degree <math>k</math> (another natural number), and in the final step, one would have to describe <math>k+1</math> parameters; the total length would be <math>L(H)</math>. One would then describe the points in <math>D</math> using some fixed code for the x-values and then using a code for the <math>n</math> deviations <math>y_i - H(x_i)</math>.

In practice, one often (but not always) uses a ''probabilistic model''. For example, one associates each polynomial <math>H</math> with the corresponding conditional distribution expressing that given <math>X</math>, <math>Y</math> is normally distributed with mean <math>H(X)</math> and some variance <math>\sigma^2</math> which could either be fixed or added as a free parameter. Then the set of hypotheses <math>{\cal H}</math> reduces to the assumption of a linear{{Clarify|date=May 2022|reason=why linear?}} model, <math>Y=H(X)+\epsilon</math> , with <math>H</math> a polynomial.

Furthermore, one is often not directly interested in specific parameters values, but just, for example, the ''degree'' of the polynomial. In that case, one sets <math>{\cal H}</math> to be <math>{\cal H} = \{ {\cal H}_0, {\cal H}_1, \ldots \}</math>where each <math>{\cal H}_j</math> represents the hypothesis that the data is best described as a j-th degree polynomial. One then codes data <math>D</math> given hypothesis <math>{\cal H}_j</math> using a ''one-part code'' designed such that, whenever ''some'' hypothesis <math>H \in {\cal H}_j</math> fits the data well, the codelength <math>L(D|H)</math> is short. The design of such codes is called [[universal code (data compression)|universal coding]]. There are various types of universal codes one could use, often giving similar lengths for long data sequences but differing for short ones. The 'best' (in the sense that it has a minimax optimality property) are the ''normalized maximum likelihood'' (NML)  or ''Shtarkov'' codes. A quite useful class of codes are the ''Bayesian marginal likelihood codes.'' For exponential families of distributions, when Jeffreys prior is used and the parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys'  prior, albeit for different reasons. The MDL approach to model selection "gives a selection criterion formally identical to the [[Bayesian Information Criterion|BIC]] approach"<ref>{{cite book |doi=10.1007/978-0-387-84858-7_7 |chapter=Model Assessment and Selection |title=The Elements of Statistical Learning |series=Springer Series in Statistics |year=2009 |last1=Hastie |first1=Trevor |last2=Tibshirani |first2=Robert |last3=Friedman |first3=Jerome |pages=219–259 |isbn=978-0-387-84857-0 }}</ref> for large number of samples.

===Example of Statistical MDL Learning===
{{Multiple issues|section=yes|
{{confusing section|date=March 2016}}
{{more citations needed section|date=March 2016}}
}}
A coin is flipped 1000 times, and the numbers of heads and tails are recorded. Consider two model classes: 
*The first is a code that represents outcomes with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1000 bits.
*The second consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1000 bits.

For this reason, a naive statistical method might choose the second model as a better explanation for the data. However, an MDL approach would construct a single code based on the hypothesis, instead of just using the best one. This code could be the normalized maximum likelihood code or a Bayesian code.  If such a code is used, then the total codelength based on the second model class would be larger than 1000 bits. Therefore, the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.

===Statistical MDL Notation===
Central to MDL theory is the [[one-to-one correspondence]] between code length [[function (mathematics)|functions]] and [[probability distribution]]s (this follows from the [[Kraft–McMillan theorem|Kraft–McMillan inequality]]). For any probability distribution <math>P</math>, it is possible to construct a code <math>C</math> such that the length (in bits) of <math>C(x)</math> is equal to <math>-\log_2 P(x)</math>; this code minimizes the expected code length. Conversely, given a code <math>C</math>, one can construct a probability distribution <math>P</math> such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code is equivalent to searching for a good probability distribution.

===Limitations of Statistical MDL Learning===
The description language of statistical MDL is not computationally universal.  Therefore it cannot, even in principle, learn models of recursive natural processes.

===Related concepts===
Statistical MDL learning is very strongly connected to [[probability theory]] and [[statistics]] through the correspondence between codes and probability distributions mentioned above. This has led some researchers to view MDL as equivalent to [[Bayesian inference]]: code length of model and data together in MDL correspond respectively to [[prior probability]] and [[marginal likelihood]] in the Bayesian framework.<ref name="mackay">{{cite book |last1=MacKay |first1=David J. C. |last2=Kay |first2=David J. C. Mac |title=Information Theory, Inference and Learning Algorithms |date=2003 |publisher=Cambridge University Press |isbn=978-0-521-64298-9 }}{{page needed|date=May 2020}}</ref>

While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov ''normalized maximum likelihood code'', which plays a central role in current MDL theory, but has no equivalent in Bayesian inference.  Furthermore, Rissanen stresses that we should make no assumptions about the ''true'' [[Probabilistic model|data-generating process]]: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense.<ref name="cwi">{{cite news
|url=http://www.mdl-research.net/jorma.rissanen/
|title=Homepage of Jorma Rissanen
|last=Rissanen
|first=Jorma
|date=
|publisher=
|accessdate=2010-07-03
|archive-url=https://web.archive.org/web/20151210054325/http://www.mdl-research.net/jorma.rissanen/
|archive-date=2015-12-10
|url-status=dead
}}</ref>{{self-published inline|date=May 2020}}<ref name="springer">{{cite book
|url=https://www.springer.com/computer/foundations/book/978-0-387-36610-4
|title=Information and Complexity in Statistical Modeling
|last=Rissanen
|first=J.
|year=2007
|publisher=Springer
|accessdate=2010-07-03}}{{page needed|date=May 2020}}</ref> In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the [[Kolmogorov structure function]].

According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe [[Prior probability|priors]] that would lead to poor results. The priors that are acceptable from an MDL point of view also tend to be favored in so-called [[Objective Bayesian probability|objective Bayesian]] analysis; there, however, the motivation is usually different.<ref name="volker">{{cite journal |last1=Nannen |first1=Volker |title=A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL) |date=May 2010 |arxiv=1005.2364 |bibcode=2010arXiv1005.2364N }}</ref>