Editing Minimum description length (section)

==Other systems==
Rissanen's was not the first [[information theory|information-theoretic]] approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called [[minimum message length]] (MML). The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation:
* MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's beliefs about the data-generating process in the form of a prior distribution. MDL avoids assumptions about the data-generating process.
* Both methods make use of ''two-part codes'': the first part always represents the information that one is trying to learn, such as the index of a model class ([[model selection]]) or parameter values ([[parameter estimation]]); the second part is an encoding of the data given the information in the first part. The difference between the methods is that, in the MDL literature, it is advocated that unwanted parameters should be moved to the second part of the code, where they can be represented with the data by using a so-called [[one-part code]], which is often more efficient than a ''two-part code''. In the original description of MML, all parameters are encoded in the first part, so all parameters are learned.
* Within the MML framework, each parameter is stated to exactly the precision which results in the optimal overall message length: the preceding example might arise if some parameter was originally considered "possibly useful" to a model but was subsequently found to be unable to help to explain the data (such a parameter will be assigned a code length corresponding to the (Bayesian) prior probability that the parameter would be found to be unhelpful).  In the MDL framework, the focus is more on comparing model classes than models, and it is more natural to approach the same question by comparing the class of models that explicitly include such a parameter against some other class that doesn't.  The difference lies in the machinery applied to reach the same conclusion.