Editing Occam's razor (section)

=== Probability theory and statistics ===
Marcus Hutter's universal artificial intelligence builds upon [[Solomonoff's theory of inductive inference|Solomonoff's mathematical formalization of the razor]] to calculate the expected value of an action.

There are various papers in scholarly journals deriving formal versions of Occam's razor from probability theory, applying it in [[statistical inference]], and using it to come up with criteria for penalizing complexity in statistical inference. Papers<ref name="ReferenceC">{{Cite journal |last1=Wallace |first1=C. S. |last2=Boulton |first2=D. M. |date=1968-08-01 |title=An Information Measure for Classification |url=https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/11.2.185 |journal=The Computer Journal |language=en |volume=11 |issue=2 |pages=185–194 |doi=10.1093/comjnl/11.2.185 }}</ref><ref name="auto">{{Cite journal |last=Wallace |first=C. S. |date=1999-04-01 |title=Minimum Message Length and Kolmogorov Complexity |url=https://www.csse.monash.edu/~dld/Publications/1999/WallaceDowe1999aMinimumMessageLengthAndKolmogorovComplexity.pdf |journal=The Computer Journal |language=en |volume=42 |issue=4 |pages=270–283 |doi=10.1093/comjnl/42.4.270 }}</ref> have suggested a connection between Occam's razor and [[Kolmogorov complexity]].<ref name="Volker">{{Cite web |url=http://volker.nannen.com/pdf/short_introduction_to_model_selection.pdf |title=A short introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length |last=Nannen |first=Volker |url-status=live |archive-url=https://web.archive.org/web/20100602044851/http://volker.nannen.com/pdf/short_introduction_to_model_selection.pdf |archive-date=2 June 2010 |access-date=3 July 2010}}</ref>

One of the problems with the original formulation of the razor is that it only applies to models with the same explanatory power (i.e., it only tells us to prefer the simplest of equally good models). A more general form of the razor can be derived from Bayesian model comparison, which is based on [[Bayes factor]]s and can be used to compare models that do not fit the observations equally well. These methods can sometimes optimally balance the complexity and power of a model. Generally, the exact Occam factor is intractable, but approximations such as [[Akaike information criterion]], [[Bayesian information criterion]], [[Variational Bayesian methods]], [[false discovery rate]], and [[Laplace's method]] are used. Many [[artificial intelligence]] researchers are now employing such techniques, for instance through work on [[Occam Learning]] or more generally on the [[Free energy principle]].

Statistical versions of Occam's razor have a more rigorous formulation than what philosophical discussions produce. In particular, they must have a specific definition of the term ''simplicity'', and that definition can vary. For example, in the [[Andrey Kolmogorov|Kolmogorov]]–[[Gregory Chaitin|Chaitin]] [[minimum description length]] approach, the subject must pick a [[Turing machine]] whose operations describe the basic operations ''believed'' to represent "simplicity" by the subject. However, one could always choose a Turing machine with a simple operation that happened to construct one's entire theory and would hence score highly under the razor. This has led to two opposing camps: one that believes Occam's razor is objective, and one that believes it is subjective.

==== Objective razor ====
The minimum instruction set of a [[universal Turing machine]] requires approximately the same length description across different formulations, and is small compared to the [[Kolmogorov complexity]] of most practical theories. [[Marcus Hutter]] has used this consistency to define a "natural" Turing machine of small size as the proper basis for excluding arbitrarily complex instruction sets in the formulation of razors.<ref>{{Cite web |url=http://www.hutter1.net/ait.htm |title=Algorithmic Information Theory |url-status=live |archive-url=https://web.archive.org/web/20071224043538/http://www.hutter1.net/ait.htm |archive-date=24 December 2007}}</ref> Describing the program for the universal program as the "hypothesis", and the representation of the evidence as program data, it has been formally proven under [[Zermelo–Fraenkel set theory]] that "the sum of the log universal probability of the model plus the log of the probability of the data given the model should be minimized."<ref>{{Cite journal |last1=Vitanyi |first1=P.M.B. |last2=Ming Li |date=March 2000 |title=Minimum description length induction, Bayesianism, and Kolmogorov complexity |url=https://ieeexplore.ieee.org/document/825807 |journal=IEEE Transactions on Information Theory |volume=46 |issue=2 |pages=446–464 |doi=10.1109/18.825807|arxiv=cs/9901014 }}</ref> Interpreting this as minimising the total length of a two-part message encoding model followed by data given model gives us the [[minimum message length]] (MML) principle.<ref name="ReferenceC" /><ref name="auto" />

One possible conclusion from mixing the concepts of Kolmogorov complexity and Occam's razor is that an ideal data compressor would also be a scientific explanation/formulation generator. Some attempts have been made to re-derive known laws from considerations of simplicity or compressibility.<ref name="ReferenceB" /><ref>{{Cite journal |last=Standish |first=Russell K |year=2000 |title=Why Occam's Razor |journal=Foundations of Physics Letters |volume=17 |issue=3 |pages=255–266 |arxiv=physics/0001020 |bibcode=2004FoPhL..17..255S |doi=10.1023/B:FOPL.0000032475.18334.0e|s2cid=17143230 }}</ref>

According to [[Jürgen Schmidhuber]], the appropriate mathematical theory of Occam's razor already exists, namely, [[Ray Solomonoff|Solomonoff's]] [[Solomonoff's theory of inductive inference|theory of optimal inductive inference]]<ref>{{Cite journal |last=Solomonoff |first=Ray |author-link=Ray Solomonoff |year=1964 |title=A formal theory of inductive inference. Part I. |journal=Information and Control |volume=7 |issue=1–22 |page=1964 |doi=10.1016/s0019-9958(64)90223-2|doi-access=free }}</ref> and its extensions.<ref>{{Cite book |title=Artificial General Intelligence |last=Schmidhuber |first=J. |year=2006 |editor-last=Goertzel |editor-first=B. |pages=177–200 |chapter=The New AI: General & Sound & Relevant for Physics |arxiv=cs.AI/0302012 |author-link=Jürgen Schmidhuber |editor-last2=Pennachin |editor-first2=C.}}</ref> See discussions in David L. Dowe's "Foreword re C. S. Wallace"<ref>{{Cite journal |last=Dowe |first=David L. |year=2008 |title=Foreword re C. S. Wallace |journal=Computer Journal |volume=51 |issue=5 |pages=523–560 |doi=10.1093/comjnl/bxm117|s2cid=5387092 }}</ref> for the subtle distinctions between the [[algorithmic probability]] work of Solomonoff and the MML work of [[Chris Wallace (computer scientist)|Chris Wallace]], and see Dowe's "MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness"<ref>David L. Dowe (2010): "MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness. A formal theory of inductive inference." ''Handbook of the Philosophy of Science''{{spaced ndash}}(HPS Volume 7) Philosophy of Statistics, Elsevier 2010 Page(s):901–982. https://web.archive.org/web/20140204001435/http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.709&rep=rep1&type=pdf</ref> both for such discussions and for (in section 4) discussions of MML and Occam's razor. For a specific example of MML as Occam's razor in the problem of decision tree induction, see Dowe and Needham's "Message Length as an Effective Ockham's Razor in Decision Tree Induction".<ref>Scott Needham and David L. Dowe (2001):" Message Length as an Effective Ockham's Razor in Decision Tree Induction." Proc. 8th International Workshop on Artificial Intelligence and Statistics (AI+STATS 2001), Key West, Florida, U.S.A., January 2001 Page(s): 253–260 {{Cite web |url=http://www.csse.monash.edu.au/~dld/Publications/2001/Needham+Dowe2001_Ockham.pdf |title=2001 Ockham.pdf |url-status=live |archive-url=https://web.archive.org/web/20150923211645/http://www.csse.monash.edu.au/~dld/Publications/2001/Needham+Dowe2001_Ockham.pdf |archive-date=23 September 2015 |access-date=2 September 2015}}</ref>

==== Mathematical arguments against Occam's razor ====
{{Technical|date=February 2024|section}}
The [[No free lunch theorem|no free lunch]] (NFL) theorems for inductive inference prove that Occam's razor must rely on ultimately arbitrary assumptions concerning the prior probability distribution found in our world.<ref name="Adam2019">Adam, S., and Pardalos, P. (2019), [https://www.researchgate.net/profile/Stamatios-Aggelos-Alexandropoulos-2/publication/333007007_No_Free_Lunch_Theorem_A_Review/links/5e84f65792851c2f52742c85/No-Free-Lunch-Theorem-A-Review.pdf No-free lunch Theorem: A review], in "Approximation and Optimization", Springer, 57-82</ref> Specifically, suppose one is given two inductive inference algorithms, A and B, where A is a [[Bayesian inference|Bayesian]] procedure based on the choice of some prior distribution motivated by Occam's razor (e.g., the prior might favor hypotheses with smaller [[Kolmogorov complexity]]). Suppose that B is the anti-Bayes procedure, which calculates what the Bayesian algorithm A based on Occam's razor will predict  – and then predicts the exact opposite. Then there are just as many actual priors (including those different from the Occam's razor prior assumed by A) in which algorithm B outperforms A as priors in which the procedure A based on Occam's razor comes out on top. In particular, the NFL theorems show that the "Occam factors" Bayesian argument for Occam's razor must make ultimately arbitrary modeling assumptions.<ref name="WOLP95">Wolpert, D.H (1995), On the Bayesian "Occam Factors" Argument for Occam's Razor, in "Computational Learning Theory and Natural Learning Systems: Selecting Good Models", MIT Press</ref>