Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Information bottleneck method
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Gaussian bottleneck == The Gaussian bottleneck,<ref>{{Cite journal|url = http://www.jmlr.org/papers/volume6/chechik05a/chechik05a.pdf|title = Information Bottleneck for Gaussian Variables|last1 = Chechik|first1 = Gal|date = 1 January 2005|journal = Journal of Machine Learning Research|issue = 6|publication-date = 1 May 2005|pages = 165–188|last2 = Globerson|first2 = Amir|last3 = Tishby|first3 = Naftali|last4 = Weiss|first4 = Yair|editor-last = Dayan|editor-first = Peter}}</ref> namely, applying the information bottleneck approach to Gaussian variables, leads to solutions related to [[canonical correlation]] analysis. Assume <math>X, Y \,</math> are jointly multivariate zero mean normal vectors with covariances <math>\Sigma_{XX}, \,\, \Sigma_{YY}</math> and <math>T\,</math> is a compressed version of <math>X\,</math> that must maintain a given value of mutual information with <math>Y\,</math>. It can be shown that the optimum <math>T\,</math> is a normal vector consisting of linear combinations of the elements of <math>X , \,\, T=AX \,</math> where matrix <math>A \,</math> has orthogonal rows. The projection matrix <math>A\,</math> in fact contains <math>M\,</math> rows selected from the weighted left [[eigenvectors]] of the [[singular value decomposition]] of the matrix (generally asymmetric) : <math>\Omega = \Sigma_{X|Y} \Sigma_{XX}^{-1} = I - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{XY}^T \Sigma_{XX}^{-1}.\,</math> Define the singular value decomposition : <math>\Omega = U\Lambda V^T\text{ with }\Lambda = \operatorname{Diag} \big ( \lambda_1 \le \lambda_2 \cdots \lambda_N \big ) \,</math> and the critical values : <math>\beta_i^C \underset {\lambda_i < 1}{=} (1 - \lambda_i )^{-1}. \, </math> then the number <math>M \,</math> of active eigenvectors in the projection, or order of approximation, is given by : <math>\beta_{M-1}^C < \beta \le \beta_M^C</math> And we finally get : <math>A=[ w_1 U_1 , \dots , w_M U_M ]^T</math> In which the weights are given by : <math>w_i = \sqrt{\left(\beta (1- \lambda_i )-1\right)/\lambda_i r_i}</math> where <math>r_i = U_i^T \Sigma_{XX} U_i.\,</math> Applying the Gaussian information bottleneck to [[time series]] (processes), yields solutions related to optimal predictive coding. This procedure is formally equivalent to '''linear''' Slow Feature Analysis.<ref>{{Cite journal|title = Predictive Coding and the Slowness Principle: An Information-Theoretic Approach|journal = Neural Computation|date = 2007-12-17|issn = 0899-7667|pages = 1026–1041|volume = 20|issue = 4|doi = 10.1162/neco.2008.01-07-455|pmid = 18085988|first1 = Felix|last1 = Creutzig|first2 = Henning|last2 = Sprekeler|author1-link=Felix Creutzig|citeseerx = 10.1.1.169.6917|s2cid = 2138951}}</ref> Optimal temporal structures in linear dynamic systems can be revealed in the so-called past-future information bottleneck, an application of the bottleneck method to non-Gaussian sampled data.<ref>{{Cite journal|title = Past-future information bottleneck in dynamical systems|journal = Physical Review E|date = 2009-04-27|pages = 041925|volume = 79|issue = 4|doi = 10.1103/PhysRevE.79.041925|pmid = 19518274|first1 = Felix|last1 = Creutzig|first2 = Amir|last2 = Globerson|first3 = Naftali|last3 = Tishby|bibcode = 2009PhRvE..79d1925C}}</ref> The concept, as treated by Creutzig, Tishby et al., is not without complication as two independent phases make up in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck. === Density estimation === {{Main|Density estimation}} Since the bottleneck method is framed in probabilistic rather than statistical terms, the underlying probability density at the sample points <math>X = {x_i} \,</math>must be estimated. This is a well known problem with multiple solutions described by [[Bernard Silverman|Silverman]].<ref name=":2" /> In the present method, joint sample probabilities are found by use of a [[Stochastic matrix|Markov transition matrix]] method and this has some mathematical synergy with the bottleneck method itself. The arbitrarily increasing distance metric <math>f \,</math> between all sample pairs and [[distance matrix]] is <math>d_{i,j}=f \Big ( \Big| x_i - x_j \Big | \Big )</math> . Then transition probabilities between sample pairs <math>P_{i,j}=\exp (- \lambda d_{i,j} ) \,</math> for some <math>\lambda > 0 \,</math>must be computed. Treating samples as states, and a normalised version of <math>P \,</math> as a Markov state transition probability matrix, the vector of probabilities of the 'states' after <math>t \,</math> steps, conditioned on the initial state <math>p(0) \,</math>, is <math>p(t)=P^t p(0) \,</math>. The equilibrium probability vector <math>p(\infty ) \,</math> given, in the usual way, by the dominant eigenvector of matrix <math>P \,</math> which is independent of the initialising vector <math>p(0) \,</math>. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities' densities there. Other interpretations of the use of the eigenvalues of distance matrix <math>d \,</math> are discussed in Silverman's ''Density Estimation for Statistics and Data Analysis''.<ref name=":2">{{cite book|last = Silverman|first = Bernie|title = Density Estimation for Statistics and Data Analysis|series = Monographs on Statistics and Applied Probability|publisher = Chapman & Hall|year = 1986|isbn = 978-0412246203|author-link = Bernie Silverman|bibcode = 1986desd.book.....S|url-access = registration|url = https://archive.org/details/densityestimatio00silv_0}}</ref> === Clusters === In the following soft clustering example, the reference vector <math>Y \,</math> contains sample categories and the joint probability <math>p(X,Y) \,</math> is assumed known. A soft cluster <math>c_k \,</math> is defined by its probability distribution over the data samples <math>x_i: \,\,\, p( c_k |x_i)</math>. Tishby et al. presented<ref name=":0" /> the following iterative set of equations to determine the clusters which are ultimately a generalization of the [[Blahut-Arimoto algorithm]], developed in [[rate distortion theory]]. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of [[Boltzmann distribution|Gibbs Distributions]] in deterministic annealing.<ref name=":3">{{Cite book|publisher = ACM|date = 2000-01-01|location = New York, NY, USA|isbn = 978-1-58113-226-7|pages = 208–215|series = SIGIR '00|doi = 10.1145/345508.345578|first1 = Noam|last1 = Slonim|first2 = Naftali|last2 = Tishby| title=Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval | chapter=Document clustering using word clusters via the information bottleneck method |citeseerx = 10.1.1.21.3062|s2cid = 1373541}}</ref><ref>D. J. Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp. 591–597</ref> : <math>\begin{cases} p(c|x)=Kp(c) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x) \,|| \, p(y| c)\Big ] \Big)\\ p(y| c)=\textstyle \sum_x p(y|x)p( c | x) p(x) \big / p(c) \\ p(c) = \textstyle \sum_x p(c | x) p(x) \\ \end{cases} </math> The function of each line of the iteration expands as '''Line 1:''' This is a matrix valued set of conditional probabilities : <math>A_{i,j} = p(c_i | x_j )=Kp(c_i) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math> The [[Kullback–Leibler divergence]] <math>D^{KL} \,</math> between the <math>Y \,</math> vectors generated by the sample data <math>x \,</math> and those generated by its reduced information proxy <math>c \,</math> is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data <math>Y \,</math> in accordance with the fundamental bottleneck equation. <math>D^{KL}(a||b)\,</math> is the Kullback–Leibler divergence between distributions <math>a, b \,</math> : <math>D^{KL} (a||b)= \sum_i p(a_i) \log \Big ( \frac{p(a_i)}{p(b_i)} \Big ) </math> and <math>K \,</math> is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay. '''Line 2: '''Second matrix-valued set of conditional probabilities. By definition : <math>\begin{align} p(y_i|c_k) & = \sum_j p(y_i|x_j)p(x_j|c_k) \\ & =\sum_j p(y_i|x_j)p(x_j, c_k ) \big / p(c_k) \\ & =\sum_j p(y_i|x_j)p(c_k | x_j) p(x_j) \big / p(c_k) \\ \end{align}</math> where the Bayes identities <math>p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,</math> are used. '''Line 3:''' this line finds the marginal distribution of the clusters <math>c \,</math> : <math>\begin{align} p(c_i)& =\sum_j p(c_i , x_j) & = \sum_j p(c_i | x_j) p(x_j) \end{align}</math> This is a standard result. Further inputs to the algorithm are the marginal sample distribution <math>p(x) \,</math> which has already been determined by the dominant eigenvector of <math>P \,</math> and the matrix valued Kullback–Leibler divergence function : <math>D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math> derived from the sample spacings and transition probabilities. The matrix <math>p(y_i | c_j) \,</math> can be initialized randomly or with a reasonable guess, while matrix <math>p(c_i | x_j) \,</math> needs no prior values. Although the algorithm converges, multiple minima may exist that would need to be resolved.<ref name=":1">{{cite conference |url = https://papers.nips.cc/paper/1896-data-clustering-by-markovian-relaxation-and-the-information-bottleneck-method.pdf|title = Data clustering by Markovian Relaxation and the Information Bottleneck Method|conference = Neural Information Processing Systems (NIPS) 2000|last1 = Tishby|first1 = Naftali|author-link1 = Naftali Tishby|last2 = Slonim|first2 = N|pages = 640–646}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)