Editing Entropy (information theory) (section)

== Use in machine learning ==
[[Machine learning]] techniques arise largely from statistics and also information theory. In general, entropy is a measure of uncertainty and the objective of machine learning is to minimize uncertainty.

[[Decision tree learning]] algorithms use relative entropy to determine the decision rules that govern the data at each node.<ref>{{Cite book|last1=Batra|first1=Mridula|last2=Agrawal|first2=Rashmi|title=Nature Inspired Computing|chapter=Comparative Analysis of Decision Tree Algorithms|date=2018|editor-last=Panigrahi|editor-first=Bijaya Ketan|editor2-last=Hoda|editor2-first=M. N.|editor3-last=Sharma|editor3-first=Vinod|editor4-last=Goel|editor4-first=Shivendra|chapter-url=https://link.springer.com/chapter/10.1007/978-981-10-6747-1_4|series=Advances in Intelligent Systems and Computing|volume=652|language=en|location=Singapore|publisher=Springer|pages=31–36|doi=10.1007/978-981-10-6747-1_4|isbn=978-981-10-6747-1|access-date=16 December 2021|archive-date=19 December 2022|archive-url=https://web.archive.org/web/20221219153239/https://link.springer.com/chapter/10.1007/978-981-10-6747-1_4|url-status=live}}</ref> The [[information gain in decision trees]] <math>IG(Y,X)</math>, which is equal to the difference between the entropy of <math>Y</math> and the conditional entropy of <math>Y</math> given <math>X</math>, quantifies the expected information, or the reduction in entropy, from additionally knowing the value of an attribute <math>X</math>. The information gain is used to identify which attributes of the dataset provide the most information and should be used to split the nodes of the tree optimally.

[[Bayesian inference]] models often apply the [[principle of maximum entropy]] to obtain [[prior probability]] distributions.<ref>{{Cite journal|last=Jaynes|first=Edwin T.|date=September 1968|title=Prior Probabilities|url=https://ieeexplore.ieee.org/document/4082152|journal=IEEE Transactions on Systems Science and Cybernetics|volume=4|issue=3|pages=227–241|doi=10.1109/TSSC.1968.300117|issn=2168-2887|access-date=16 December 2021|archive-date=16 December 2021|archive-url=https://web.archive.org/web/20211216164659/https://ieeexplore.ieee.org/document/4082152|url-status=live}}</ref> The idea is that the distribution that best represents the current state of knowledge of a system is the one with the largest entropy, and is therefore suitable to be the prior.

[[Classification in machine learning]] performed by [[logistic regression]] or [[artificial neural network]]s often employs a standard loss function, called [[cross-entropy]] loss, that minimizes the average cross entropy between ground truth and predicted distributions.<ref>{{Cite book|last1=Rubinstein|first1=Reuven Y.|url=https://books.google.com/books?id=8KgACAAAQBAJ&dq=machine+learning+cross+entropy+loss+introduction&pg=PA1|title=The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning|last2=Kroese|first2=Dirk P.|date=2013-03-09|publisher=Springer Science & Business Media|isbn=978-1-4757-4321-0|language=en}}</ref> In general, cross entropy is a measure of the differences between two datasets similar to the KL divergence (also known as relative entropy).