Editing Information bottleneck method (section)

=== Clusters ===
In the following soft clustering example, the reference vector <math>Y \,</math> contains sample categories and the joint probability <math>p(X,Y) \,</math> is assumed known. A soft cluster <math>c_k \,</math> is defined by its probability distribution over the data samples <math>x_i: \,\,\, p( c_k |x_i)</math>. Tishby et al. presented<ref name=":0" /> the following iterative set of equations to determine the clusters which are ultimately a generalization of the [[Blahut-Arimoto algorithm]], developed in [[rate distortion theory]]. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of [[Boltzmann distribution|Gibbs Distributions]] in deterministic annealing.<ref name=":3">{{Cite book|publisher = ACM|date = 2000-01-01|location = New York, NY, USA|isbn = 978-1-58113-226-7|pages = 208–215|series = SIGIR '00|doi = 10.1145/345508.345578|first1 = Noam|last1 = Slonim|first2 = Naftali|last2 = Tishby| title=Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval | chapter=Document clustering using word clusters via the information bottleneck method |citeseerx = 10.1.1.21.3062|s2cid = 1373541}}</ref><ref>D. J.  Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp.&nbsp;591–597</ref>

: <math>\begin{cases}
p(c|x)=Kp(c) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x) \,|| \, p(y| c)\Big ] \Big)\\
p(y| c)=\textstyle \sum_x p(y|x)p( c | x) p(x) \big / p(c) \\
p(c) = \textstyle \sum_x p(c | x) p(x) \\
\end{cases}
</math>

The function of each line of the iteration expands as

'''Line 1:''' This is a matrix valued set of conditional probabilities

: <math>A_{i,j} = p(c_i | x_j )=Kp(c_i) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math>

The [[Kullback–Leibler divergence]] <math>D^{KL} \,</math> between the <math>Y \,</math> vectors generated by the sample data <math>x \,</math> and those generated by its reduced information proxy <math>c \,</math> is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data <math>Y \,</math> in accordance with the fundamental bottleneck equation. <math>D^{KL}(a||b)\,</math> is the Kullback–Leibler divergence between distributions <math>a, b \,</math>

: <math>D^{KL} (a||b)= \sum_i p(a_i) \log \Big ( \frac{p(a_i)}{p(b_i)} \Big ) </math>

and <math>K \,</math> is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay.

'''Line 2:	 '''Second matrix-valued set of conditional probabilities. By definition

: <math>\begin{align}
p(y_i|c_k) & = \sum_j p(y_i|x_j)p(x_j|c_k) \\
  & =\sum_j p(y_i|x_j)p(x_j, c_k ) \big / p(c_k)  \\
&  =\sum_j p(y_i|x_j)p(c_k | x_j) p(x_j) \big / p(c_k) \\
\end{align}</math>
where the Bayes identities <math>p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,</math> are used.

'''Line 3:''' this line finds the marginal distribution of the clusters <math>c \,</math>

: <math>\begin{align}
p(c_i)& =\sum_j p(c_i , x_j)
& = \sum_j p(c_i | x_j) p(x_j)
\end{align}</math>

This is a standard result.

Further inputs to the algorithm are the marginal sample distribution <math>p(x) \,</math> which has already been determined by the dominant eigenvector of <math>P \,</math> and the matrix valued Kullback–Leibler divergence function

: <math>D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math>

derived from the sample spacings and transition probabilities.

The matrix <math>p(y_i | c_j) \,</math> can be initialized randomly or with a reasonable guess, while matrix <math>p(c_i | x_j) \,</math> needs no prior values. Although the algorithm converges, multiple minima may exist that would need to be resolved.<ref name=":1">{{cite conference |url = https://papers.nips.cc/paper/1896-data-clustering-by-markovian-relaxation-and-the-information-bottleneck-method.pdf|title = Data clustering by Markovian Relaxation and the Information Bottleneck Method|conference = Neural Information Processing Systems (NIPS) 2000|last1 = Tishby|first1 = Naftali|author-link1 = Naftali Tishby|last2 = Slonim|first2 = N|pages = 640–646}}</ref>