Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Information bottleneck method
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Clusters === In the following soft clustering example, the reference vector <math>Y \,</math> contains sample categories and the joint probability <math>p(X,Y) \,</math> is assumed known. A soft cluster <math>c_k \,</math> is defined by its probability distribution over the data samples <math>x_i: \,\,\, p( c_k |x_i)</math>. Tishby et al. presented<ref name=":0" /> the following iterative set of equations to determine the clusters which are ultimately a generalization of the [[Blahut-Arimoto algorithm]], developed in [[rate distortion theory]]. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of [[Boltzmann distribution|Gibbs Distributions]] in deterministic annealing.<ref name=":3">{{Cite book|publisher = ACM|date = 2000-01-01|location = New York, NY, USA|isbn = 978-1-58113-226-7|pages = 208–215|series = SIGIR '00|doi = 10.1145/345508.345578|first1 = Noam|last1 = Slonim|first2 = Naftali|last2 = Tishby| title=Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval | chapter=Document clustering using word clusters via the information bottleneck method |citeseerx = 10.1.1.21.3062|s2cid = 1373541}}</ref><ref>D. J. Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp. 591–597</ref> : <math>\begin{cases} p(c|x)=Kp(c) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x) \,|| \, p(y| c)\Big ] \Big)\\ p(y| c)=\textstyle \sum_x p(y|x)p( c | x) p(x) \big / p(c) \\ p(c) = \textstyle \sum_x p(c | x) p(x) \\ \end{cases} </math> The function of each line of the iteration expands as '''Line 1:''' This is a matrix valued set of conditional probabilities : <math>A_{i,j} = p(c_i | x_j )=Kp(c_i) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math> The [[Kullback–Leibler divergence]] <math>D^{KL} \,</math> between the <math>Y \,</math> vectors generated by the sample data <math>x \,</math> and those generated by its reduced information proxy <math>c \,</math> is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data <math>Y \,</math> in accordance with the fundamental bottleneck equation. <math>D^{KL}(a||b)\,</math> is the Kullback–Leibler divergence between distributions <math>a, b \,</math> : <math>D^{KL} (a||b)= \sum_i p(a_i) \log \Big ( \frac{p(a_i)}{p(b_i)} \Big ) </math> and <math>K \,</math> is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay. '''Line 2: '''Second matrix-valued set of conditional probabilities. By definition : <math>\begin{align} p(y_i|c_k) & = \sum_j p(y_i|x_j)p(x_j|c_k) \\ & =\sum_j p(y_i|x_j)p(x_j, c_k ) \big / p(c_k) \\ & =\sum_j p(y_i|x_j)p(c_k | x_j) p(x_j) \big / p(c_k) \\ \end{align}</math> where the Bayes identities <math>p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,</math> are used. '''Line 3:''' this line finds the marginal distribution of the clusters <math>c \,</math> : <math>\begin{align} p(c_i)& =\sum_j p(c_i , x_j) & = \sum_j p(c_i | x_j) p(x_j) \end{align}</math> This is a standard result. Further inputs to the algorithm are the marginal sample distribution <math>p(x) \,</math> which has already been determined by the dominant eigenvector of <math>P \,</math> and the matrix valued Kullback–Leibler divergence function : <math>D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math> derived from the sample spacings and transition probabilities. The matrix <math>p(y_i | c_j) \,</math> can be initialized randomly or with a reasonable guess, while matrix <math>p(c_i | x_j) \,</math> needs no prior values. Although the algorithm converges, multiple minima may exist that would need to be resolved.<ref name=":1">{{cite conference |url = https://papers.nips.cc/paper/1896-data-clustering-by-markovian-relaxation-and-the-information-bottleneck-method.pdf|title = Data clustering by Markovian Relaxation and the Information Bottleneck Method|conference = Neural Information Processing Systems (NIPS) 2000|last1 = Tishby|first1 = Naftali|author-link1 = Naftali Tishby|last2 = Slonim|first2 = N|pages = 640–646}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)