Editing Information bottleneck method (section)

== Information theory of deep learning ==

Theory of Information Bottleneck is recently used to study Deep Neural Networks (DNN).<ref name=":4">{{cite arXiv |last1=Shwartz-Ziv |first1=Ravid |last2=Tishby |first2=Naftali |title=Opening the black box of deep neural networks via information |eprint=1703.00810|class=cs.LG |year=2017 }}</ref>
Consider <math>X </math> and <math>Y </math> respectively as the input and output layers of a DNN, and let <math>T </math> be any hidden layer of the network. 
Shwartz-Ziv and Tishby proposed the information bottleneck that expresses the tradeoff between the mutual information measures <math>I(X,T)</math> and <math>I(T,Y)</math>. In this case, <math>I(X,T)</math> and <math>I(T,Y) </math> respectively quantify the amount of information that the hidden layer contains about the input and the output.
They conjectured that the training process of a DNN consists of two separate phases; 1) an initial fitting phase in which <math>I(T,Y)</math> increases, and 2) a subsequent compression phase in which <math>I(X,T)</math> decreases. Saxe et al. in <ref>{{cite journal|last1=Andrew M|first1=Saxe|display-authors=etal|date=2018|title=On the information bottleneck theory of deep learning.|url=https://openreview.net/pdf?id=ry_WPG-A-|journal=ICLR 2018 Conference Blind Submission|volume=2019|issue=12|page=124020|doi=10.1088/1742-5468/ab3985|bibcode=2019JSMTE..12.4020S|s2cid=49584497}}</ref> countered the claim of Shwartz-Ziv and Tishby,<ref name=":4" /> stating that this compression phenomenon in DNNs is not comprehensive, and it depends on the particular activation function. In particular, they claimed that the compression does not happen with ReLu activation functions. Shwartz-Ziv and Tishby disputed these claims, arguing that Saxe et al. had not observed compression due to weak estimates of the mutual information. Recently, Noshad et al. used a rate-optimal estimator of mutual information to explore this controversy, observing that the optimal hash-based estimator reveals the compression phenomenon in a wider range of networks with ReLu and maxpooling activations.<ref>{{cite arXiv |last1=Noshad |first1=Morteza |display-authors=etal |title=Scalable Mutual Information Estimation using Dependence Graphs |eprint=1801.09125 |date=2018|class=cs.IT }}</ref> On the other hand, recently Goldfeld et al. have argued that the observed compression is a result of geometric, and not of information-theoretic phenomena,<ref>{{cite journal|last1=Goldfeld|first1=Ziv|display-authors=etal|date=2019|title=Estimating Information Flow in Deep Neural Networks|url=http://proceedings.mlr.press/v97/goldfeld19a.html|journal=Icml 2019|pages=2299–2308|arxiv=1810.05728}}</ref> a view that has been shared also in.<ref>{{cite journal|first1=Bernhard C.|last1=Geiger|title=On Information Plane Analyses of Neural Network Classifiers—A Review|journal=IEEE Transactions on Neural Networks and Learning Systems |date=2022|volume=33 |issue=12 |pages=7039–7051 |doi=10.1109/TNNLS.2021.3089037 |pmid=34191733 |arxiv=2003.09671|s2cid=214611728 }}</ref>