Editing Information bottleneck method (section)

{{Short description|Technique in information theory}}
The '''information bottleneck method''' is a technique in [[information theory]] introduced by [[Naftali Tishby]], Fernando C. Pereira, and [[William Bialek]].<ref name=":0">{{cite conference |url=http://www.cs.huji.ac.il/labs/learning/Papers/allerton.pdf|title=The Information Bottleneck Method|conference=The 37th annual Allerton Conference on Communication, Control, and Computing|last1=Tishby|first1=Naftali|author-link1=Naftali Tishby|last2=Pereira|first2=Fernando C.|last3=Bialek|first3=William|author-link3=William Bialek|date=September 1999|pages=368–377}}</ref> It is designed for finding the best tradeoff between [[accuracy]] and complexity ([[Data compression|compression]]) when [[random variable|summarizing]] (e.g. [[data clustering|clustering]]) a [[random variable]] '''X''', given a [[joint probability distribution]] '''p(X,Y)''' between '''X''' and an observed relevant variable '''Y''' - and self-described as providing ''"a surprisingly rich framework for discussing a variety of problems in signal processing and learning"''.<ref name=":0"/>

Applications include distributional clustering and [[dimension reduction]], and more recently it has been suggested as a theoretical foundation for [[deep learning]]. It generalized the classical notion of minimal [[sufficient statistics]] from [[parametric statistics]] to arbitrary distributions, not necessarily of exponential form. It does so by relaxing the sufficiency condition to capture some fraction of the [[mutual information]] with the relevant variable '''Y'''. These approaches may involve convexified and entropy-regularised formulations using symbolic continuation algorithms to avoid bifurcation-induced instabilities across the β trade-off.<ref>{{Cite arXiv |arxiv=2505.09239 |title=Stable and Convexified Information Bottleneck Optimization via Symbolic Continuation and Entropy-Regularized Trajectories |last=Alpay |first=Faruk |date=2025-05-14}}</ref>

The information bottleneck can also be viewed as a [[Rate–distortion theory|rate distortion]] problem, with a distortion function that measures how well '''Y''' is predicted from a compressed representation '''T''' compared to its direct prediction from '''X'''. This interpretation provides a general iterative algorithm for solving the information bottleneck trade-off and calculating the information curve from the distribution '''p(X,Y)'''.

Let the compressed representation be given by random variable <math>T</math>. The algorithm minimizes the following functional with respect to conditional distribution <math>p(t|x)</math>:

: <math> \inf_{p(t|x)} \,\, \Big( I(X;T) - \beta I(T;Y) \Big),</math>

where <math>I(X;T)</math> and <math>I(T;Y)</math> are the mutual information of <math>X</math> and <math>T</math>, and of <math>T</math> and <math>Y</math>, respectively, and <math>\beta</math> is a [[Lagrange multiplier]].