Editing Information bottleneck method (section)

==Defining decision contours ==

To categorize a new sample <math> x' \,</math> external to the training set <math>X \,</math>, the previous distance metric finds the transition probabilities between <math> x' \,</math> and all samples in <math>X: \,\,</math>, <math> \tilde p(x_i )= p(x_i | x')= \Kappa \exp \Big (-\lambda f \big ( \Big| x_i - x' \Big | \big ) \Big )</math> with <math>\Kappa \,</math> a normalization. Secondly apply the last two lines of the 3-line algorithm to get cluster and conditional category probabilities.

: <math>\begin{align}
& \tilde p(c_i )  = p(c_i | x' ) = \sum_j p(c_i |  x_j)p(x_j | x') =\sum_j p(c_i |  x_j) \tilde p(x_j)\\
& p(y_i | c_j)  = \sum_k p(y_i | x_k) p(c_j | x_k)p(x_k | x') / p(c_j | x' )
= \sum_k p(y_i | x_k) p(c_j | x_k) \tilde p(x_k) / \tilde p(c_j) \\
\end{align}</math>

Finally

: <math>p(y_i | x')= \sum_j p(y_i | c_j) p(c_j | x') )= \sum_j p(y_i | c_j) \tilde p(c_j) \,</math>

Parameter <math>\beta \,</math> must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, snap into focus at certain critical thresholds.

=== An example ===
The following case examines clustering in a four quadrant multiplier with random inputs <math>u, v \,</math> and two categories of output, <math>\pm 1 \,</math>, generated by <math>y=\operatorname{sign}(uv) \,</math>. This function has two spatially separated clusters for each category and so demonstrates that the method can handle such distributions.

20 samples are taken, uniformly distributed on the square <math>[-1,1]^2 \,</math> . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <math>\lambda = 3,\, \beta = 2.5</math>.

The distance function is <math>d_{i,j} =  \Big| x_i - x_j \Big |^2</math> where <math>x_i = (u_i,v_i)^T \, </math> while the conditional distribution <math>p(y|x)\, </math> is a 2&nbsp;×&nbsp;20 matrix

: <math>\begin{align} & Pr(y_i=1) = 1\text{ if }\operatorname{sign}(u_iv_i)=1\, \\
& Pr(y_i= -1) = 1\text{ if }\operatorname{sign}(u_iv_i)= -1\,
\end{align}</math>

and zero elsewhere.

The summation in line 2 incorporates only two values representing the training values of +1 or &minus;1, but nevertheless works well. The figure shows the locations of the twenty samples with '0' representing ''Y'' = 1 and 'x' representing ''Y'' = &minus;1. The contour at the unity likelihood ratio level is shown,

: <math>L= \frac{\Pr(1)}{\Pr(-1)} = 1</math>

as a new sample <math>x' \,</math>is scanned over the square. Theoretically the contour should align with the <math>u=0 \,</math> and <math>v=0 \,</math> coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.
[[Image:BottleCateg 1.jpg|thumb|Decision contours]]

===Neural network/fuzzy logic analogies===
This algorithm is somewhat analogous to a neural network with a single hidden layer. The internal nodes are represented by the clusters <math>c_j \,</math> and the first and second layers of network weights are the conditional probabilities <math>p(c_j | x_i) \,</math> and <math>p(y_k | c_j) \,</math> respectively. However, unlike a standard neural network, the algorithm relies entirely on probabilities as inputs rather than the sample values themselves, while internal and output values are all conditional probability density distributions. Nonlinear functions are encapsulated in distance metric <math>f(.) \,</math> (or ''influence functions/radial basis functions'') and transition probabilities instead of [[sigmoid function]]s.

The Blahut-Arimoto three-line algorithm converges rapidly, often in tens of iterations, and by varying <math>\beta \,</math>, <math>\lambda \,</math> and <math>f \,</math> and the cardinality of the clusters, various levels of focus on features can be achieved.

The statistical soft clustering definition <math>p(c_i | x_j) \,</math> has some overlap with the verbal fuzzy membership concept of [[fuzzy logic]].