Editing Information bottleneck method (section)

== Gaussian bottleneck ==

The Gaussian bottleneck,<ref>{{Cite journal|url = http://www.jmlr.org/papers/volume6/chechik05a/chechik05a.pdf|title = Information Bottleneck for Gaussian Variables|last1 = Chechik|first1 = Gal|date = 1 January 2005|journal = Journal of Machine Learning Research|issue = 6|publication-date = 1 May 2005|pages = 165–188|last2 = Globerson|first2 = Amir|last3 = Tishby|first3 = Naftali|last4 = Weiss|first4 = Yair|editor-last = Dayan|editor-first = Peter}}</ref> namely, applying the information bottleneck approach to Gaussian variables, leads to solutions related to [[canonical correlation]] analysis. Assume <math>X, Y \,</math> are jointly multivariate zero mean normal vectors with covariances <math>\Sigma_{XX}, \,\, \Sigma_{YY}</math> and <math>T\,</math> is a compressed version of <math>X\,</math> that must maintain a given value of mutual information with <math>Y\,</math>. It can be shown that the optimum <math>T\,</math> is a normal vector consisting of linear combinations of the elements of <math>X , \,\, T=AX \,</math> where matrix <math>A \,</math> has orthogonal rows. 
 
The projection matrix <math>A\,</math> in fact contains <math>M\,</math> rows selected from the weighted left [[eigenvectors]] of the [[singular value decomposition]] of the  matrix (generally asymmetric)

: <math>\Omega = \Sigma_{X|Y} \Sigma_{XX}^{-1} = I - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{XY}^T \Sigma_{XX}^{-1}.\,</math>

Define the singular value decomposition

: <math>\Omega = U\Lambda V^T\text{ with }\Lambda = \operatorname{Diag} \big ( \lambda_1 \le \lambda_2 \cdots \lambda_N \big ) \,</math>

and the critical values

: <math>\beta_i^C \underset {\lambda_i < 1}{=} (1 - \lambda_i  )^{-1}. \, </math>

then the number <math>M \,</math> of active eigenvectors in the projection, or order of approximation, is given by

: <math>\beta_{M-1}^C < \beta \le \beta_M^C</math>

And we finally get

: <math>A=[ w_1 U_1 , \dots , w_M U_M ]^T</math>

In which the weights are given by

: <math>w_i = \sqrt{\left(\beta (1- \lambda_i )-1\right)/\lambda_i r_i}</math>

where <math>r_i = U_i^T \Sigma_{XX} U_i.\,</math>

Applying the Gaussian information bottleneck to [[time series]] (processes), yields solutions related to optimal predictive coding. This procedure is formally equivalent to '''linear''' Slow Feature Analysis.<ref>{{Cite journal|title = Predictive Coding and the Slowness Principle: An Information-Theoretic Approach|journal = Neural Computation|date = 2007-12-17|issn = 0899-7667|pages = 1026–1041|volume = 20|issue = 4|doi = 10.1162/neco.2008.01-07-455|pmid = 18085988|first1 = Felix|last1 = Creutzig|first2 = Henning|last2 = Sprekeler|author1-link=Felix Creutzig|citeseerx = 10.1.1.169.6917|s2cid = 2138951}}</ref>

Optimal temporal structures in linear dynamic systems can be revealed in the so-called past-future information bottleneck, an application of the bottleneck method to non-Gaussian sampled data.<ref>{{Cite journal|title = Past-future information bottleneck in dynamical systems|journal = Physical Review E|date = 2009-04-27|pages = 041925|volume = 79|issue = 4|doi = 10.1103/PhysRevE.79.041925|pmid = 19518274|first1 = Felix|last1 = Creutzig|first2 = Amir|last2 = Globerson|first3 = Naftali|last3 = Tishby|bibcode = 2009PhRvE..79d1925C}}</ref> The concept, as treated by Creutzig, Tishby et al., is not without complication as two independent phases make up in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

=== Density estimation ===

{{Main|Density estimation}}

Since the bottleneck method is framed in probabilistic rather than statistical terms, the underlying probability density at the sample points <math>X = {x_i} \,</math>must be estimated. This is a well known problem with multiple solutions described by [[Bernard Silverman|Silverman]].<ref name=":2" /> In the present method, joint sample probabilities are found by use of a [[Stochastic matrix|Markov transition matrix]] method and this has some mathematical synergy with the bottleneck method itself.

The arbitrarily increasing distance metric <math>f \,</math> between all sample pairs and [[distance matrix]] is <math>d_{i,j}=f \Big ( \Big| x_i - x_j \Big | \Big )</math> . Then transition probabilities between sample pairs <math>P_{i,j}=\exp (- \lambda d_{i,j} ) \,</math> for some <math>\lambda > 0 \,</math>must be computed. Treating samples as states, and a normalised version of <math>P \,</math> as a Markov state transition probability matrix, the vector of probabilities of the 'states' after <math>t \,</math> steps, conditioned on the initial state <math>p(0) \,</math>, is <math>p(t)=P^t p(0) \,</math>. The equilibrium probability vector <math>p(\infty ) \,</math> given, in the usual way, by the dominant eigenvector of matrix <math>P \,</math> which is independent of the initialising vector <math>p(0) \,</math>. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities' densities there.

Other interpretations of the use of the eigenvalues of distance matrix <math>d \,</math> are discussed in Silverman's ''Density Estimation for Statistics and Data Analysis''.<ref name=":2">{{cite book|last = Silverman|first = Bernie|title = Density Estimation for Statistics and Data Analysis|series = Monographs on Statistics and Applied Probability|publisher = Chapman & Hall|year = 1986|isbn = 978-0412246203|author-link = Bernie Silverman|bibcode = 1986desd.book.....S|url-access = registration|url = https://archive.org/details/densityestimatio00silv_0}}</ref>

=== Clusters ===
In the following soft clustering example, the reference vector <math>Y \,</math> contains sample categories and the joint probability <math>p(X,Y) \,</math> is assumed known. A soft cluster <math>c_k \,</math> is defined by its probability distribution over the data samples <math>x_i: \,\,\, p( c_k |x_i)</math>. Tishby et al. presented<ref name=":0" /> the following iterative set of equations to determine the clusters which are ultimately a generalization of the [[Blahut-Arimoto algorithm]], developed in [[rate distortion theory]]. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of [[Boltzmann distribution|Gibbs Distributions]] in deterministic annealing.<ref name=":3">{{Cite book|publisher = ACM|date = 2000-01-01|location = New York, NY, USA|isbn = 978-1-58113-226-7|pages = 208–215|series = SIGIR '00|doi = 10.1145/345508.345578|first1 = Noam|last1 = Slonim|first2 = Naftali|last2 = Tishby| title=Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval | chapter=Document clustering using word clusters via the information bottleneck method |citeseerx = 10.1.1.21.3062|s2cid = 1373541}}</ref><ref>D. J.  Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp.&nbsp;591–597</ref>

: <math>\begin{cases}
p(c|x)=Kp(c) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x) \,|| \, p(y| c)\Big ] \Big)\\
p(y| c)=\textstyle \sum_x p(y|x)p( c | x) p(x) \big / p(c) \\
p(c) = \textstyle \sum_x p(c | x) p(x) \\
\end{cases}
</math>

The function of each line of the iteration expands as

'''Line 1:''' This is a matrix valued set of conditional probabilities

: <math>A_{i,j} = p(c_i | x_j )=Kp(c_i) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math>

The [[Kullback–Leibler divergence]] <math>D^{KL} \,</math> between the <math>Y \,</math> vectors generated by the sample data <math>x \,</math> and those generated by its reduced information proxy <math>c \,</math> is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data <math>Y \,</math> in accordance with the fundamental bottleneck equation. <math>D^{KL}(a||b)\,</math> is the Kullback–Leibler divergence between distributions <math>a, b \,</math>

: <math>D^{KL} (a||b)= \sum_i p(a_i) \log \Big ( \frac{p(a_i)}{p(b_i)} \Big ) </math>

and <math>K \,</math> is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay.

'''Line 2:	 '''Second matrix-valued set of conditional probabilities. By definition

: <math>\begin{align}
p(y_i|c_k) & = \sum_j p(y_i|x_j)p(x_j|c_k) \\
  & =\sum_j p(y_i|x_j)p(x_j, c_k ) \big / p(c_k)  \\
&  =\sum_j p(y_i|x_j)p(c_k | x_j) p(x_j) \big / p(c_k) \\
\end{align}</math>
where the Bayes identities <math>p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,</math> are used.

'''Line 3:''' this line finds the marginal distribution of the clusters <math>c \,</math>

: <math>\begin{align}
p(c_i)& =\sum_j p(c_i , x_j)
& = \sum_j p(c_i | x_j) p(x_j)
\end{align}</math>

This is a standard result.

Further inputs to the algorithm are the marginal sample distribution <math>p(x) \,</math> which has already been determined by the dominant eigenvector of <math>P \,</math> and the matrix valued Kullback–Leibler divergence function

: <math>D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math>

derived from the sample spacings and transition probabilities.

The matrix <math>p(y_i | c_j) \,</math> can be initialized randomly or with a reasonable guess, while matrix <math>p(c_i | x_j) \,</math> needs no prior values. Although the algorithm converges, multiple minima may exist that would need to be resolved.<ref name=":1">{{cite conference |url = https://papers.nips.cc/paper/1896-data-clustering-by-markovian-relaxation-and-the-information-bottleneck-method.pdf|title = Data clustering by Markovian Relaxation and the Information Bottleneck Method|conference = Neural Information Processing Systems (NIPS) 2000|last1 = Tishby|first1 = Naftali|author-link1 = Naftali Tishby|last2 = Slonim|first2 = N|pages = 640–646}}</ref>