Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Information bottleneck method
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Information theory of deep learning == Theory of Information Bottleneck is recently used to study Deep Neural Networks (DNN).<ref name=":4">{{cite arXiv |last1=Shwartz-Ziv |first1=Ravid |last2=Tishby |first2=Naftali |title=Opening the black box of deep neural networks via information |eprint=1703.00810|class=cs.LG |year=2017 }}</ref> Consider <math>X </math> and <math>Y </math> respectively as the input and output layers of a DNN, and let <math>T </math> be any hidden layer of the network. Shwartz-Ziv and Tishby proposed the information bottleneck that expresses the tradeoff between the mutual information measures <math>I(X,T)</math> and <math>I(T,Y)</math>. In this case, <math>I(X,T)</math> and <math>I(T,Y) </math> respectively quantify the amount of information that the hidden layer contains about the input and the output. They conjectured that the training process of a DNN consists of two separate phases; 1) an initial fitting phase in which <math>I(T,Y)</math> increases, and 2) a subsequent compression phase in which <math>I(X,T)</math> decreases. Saxe et al. in <ref>{{cite journal|last1=Andrew M|first1=Saxe|display-authors=etal|date=2018|title=On the information bottleneck theory of deep learning.|url=https://openreview.net/pdf?id=ry_WPG-A-|journal=ICLR 2018 Conference Blind Submission|volume=2019|issue=12|page=124020|doi=10.1088/1742-5468/ab3985|bibcode=2019JSMTE..12.4020S|s2cid=49584497}}</ref> countered the claim of Shwartz-Ziv and Tishby,<ref name=":4" /> stating that this compression phenomenon in DNNs is not comprehensive, and it depends on the particular activation function. In particular, they claimed that the compression does not happen with ReLu activation functions. Shwartz-Ziv and Tishby disputed these claims, arguing that Saxe et al. had not observed compression due to weak estimates of the mutual information. Recently, Noshad et al. used a rate-optimal estimator of mutual information to explore this controversy, observing that the optimal hash-based estimator reveals the compression phenomenon in a wider range of networks with ReLu and maxpooling activations.<ref>{{cite arXiv |last1=Noshad |first1=Morteza |display-authors=etal |title=Scalable Mutual Information Estimation using Dependence Graphs |eprint=1801.09125 |date=2018|class=cs.IT }}</ref> On the other hand, recently Goldfeld et al. have argued that the observed compression is a result of geometric, and not of information-theoretic phenomena,<ref>{{cite journal|last1=Goldfeld|first1=Ziv|display-authors=etal|date=2019|title=Estimating Information Flow in Deep Neural Networks|url=http://proceedings.mlr.press/v97/goldfeld19a.html|journal=Icml 2019|pages=2299–2308|arxiv=1810.05728}}</ref> a view that has been shared also in.<ref>{{cite journal|first1=Bernhard C.|last1=Geiger|title=On Information Plane Analyses of Neural Network Classifiers—A Review|journal=IEEE Transactions on Neural Networks and Learning Systems |date=2022|volume=33 |issue=12 |pages=7039–7051 |doi=10.1109/TNNLS.2021.3089037 |pmid=34191733 |arxiv=2003.09671|s2cid=214611728 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)