Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Information bottleneck method
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Defining decision contours == To categorize a new sample <math> x' \,</math> external to the training set <math>X \,</math>, the previous distance metric finds the transition probabilities between <math> x' \,</math> and all samples in <math>X: \,\,</math>, <math> \tilde p(x_i )= p(x_i | x')= \Kappa \exp \Big (-\lambda f \big ( \Big| x_i - x' \Big | \big ) \Big )</math> with <math>\Kappa \,</math> a normalization. Secondly apply the last two lines of the 3-line algorithm to get cluster and conditional category probabilities. : <math>\begin{align} & \tilde p(c_i ) = p(c_i | x' ) = \sum_j p(c_i | x_j)p(x_j | x') =\sum_j p(c_i | x_j) \tilde p(x_j)\\ & p(y_i | c_j) = \sum_k p(y_i | x_k) p(c_j | x_k)p(x_k | x') / p(c_j | x' ) = \sum_k p(y_i | x_k) p(c_j | x_k) \tilde p(x_k) / \tilde p(c_j) \\ \end{align}</math> Finally : <math>p(y_i | x')= \sum_j p(y_i | c_j) p(c_j | x') )= \sum_j p(y_i | c_j) \tilde p(c_j) \,</math> Parameter <math>\beta \,</math> must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, snap into focus at certain critical thresholds. === An example === The following case examines clustering in a four quadrant multiplier with random inputs <math>u, v \,</math> and two categories of output, <math>\pm 1 \,</math>, generated by <math>y=\operatorname{sign}(uv) \,</math>. This function has two spatially separated clusters for each category and so demonstrates that the method can handle such distributions. 20 samples are taken, uniformly distributed on the square <math>[-1,1]^2 \,</math> . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <math>\lambda = 3,\, \beta = 2.5</math>. The distance function is <math>d_{i,j} = \Big| x_i - x_j \Big |^2</math> where <math>x_i = (u_i,v_i)^T \, </math> while the conditional distribution <math>p(y|x)\, </math> is a 2 Γ 20 matrix : <math>\begin{align} & Pr(y_i=1) = 1\text{ if }\operatorname{sign}(u_iv_i)=1\, \\ & Pr(y_i= -1) = 1\text{ if }\operatorname{sign}(u_iv_i)= -1\, \end{align}</math> and zero elsewhere. The summation in line 2 incorporates only two values representing the training values of +1 or −1, but nevertheless works well. The figure shows the locations of the twenty samples with '0' representing ''Y'' = 1 and 'x' representing ''Y'' = −1. The contour at the unity likelihood ratio level is shown, : <math>L= \frac{\Pr(1)}{\Pr(-1)} = 1</math> as a new sample <math>x' \,</math>is scanned over the square. Theoretically the contour should align with the <math>u=0 \,</math> and <math>v=0 \,</math> coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points. [[Image:BottleCateg 1.jpg|thumb|Decision contours]] ===Neural network/fuzzy logic analogies=== This algorithm is somewhat analogous to a neural network with a single hidden layer. The internal nodes are represented by the clusters <math>c_j \,</math> and the first and second layers of network weights are the conditional probabilities <math>p(c_j | x_i) \,</math> and <math>p(y_k | c_j) \,</math> respectively. However, unlike a standard neural network, the algorithm relies entirely on probabilities as inputs rather than the sample values themselves, while internal and output values are all conditional probability density distributions. Nonlinear functions are encapsulated in distance metric <math>f(.) \,</math> (or ''influence functions/radial basis functions'') and transition probabilities instead of [[sigmoid function]]s. The Blahut-Arimoto three-line algorithm converges rapidly, often in tens of iterations, and by varying <math>\beta \,</math>, <math>\lambda \,</math> and <math>f \,</math> and the cardinality of the clusters, various levels of focus on features can be achieved. The statistical soft clustering definition <math>p(c_i | x_j) \,</math> has some overlap with the verbal fuzzy membership concept of [[fuzzy logic]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)