Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Boltzmann machine
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Training== The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the [[training set]] is a set of binary vectors over the set V. The distribution over the training set is denoted <math>P^{+}(V)</math>. The distribution over global states converges as the Boltzmann machine reaches [[thermal equilibrium]]. We denote this distribution, after we [[Marginal distribution|marginalize]] it over the hidden units, as <math>P^{-}(V)</math>. Our goal is to approximate the "real" distribution <math>P^{+}(V)</math> using the <math>P^{-}(V)</math> produced by the machine. The similarity of the two distributions is measured by the [[Kullback–Leibler divergence]], <math>G</math>: :<math>G = \sum_{v}{P^{+}(v)\ln\left({\frac{P^{+}(v)}{P^{-}(v)}}\right)}</math> where the sum is over all the possible states of <math>V</math>. <math>G</math> is a function of the weights, since they determine the energy of a state, and the energy determines <math>P^{-}(v)</math>, as promised by the Boltzmann distribution. A [[gradient descent]] algorithm over <math>G</math> changes a given weight, <math>w_{ij}</math>, by subtracting the [[partial derivative]] of <math>G</math> with respect to the weight. Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to <math>P^{+}</math>). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight, <math>w_{ij}</math>, is given by the equation:<ref name=":0" /> :<math>\frac{\partial{G}}{\partial{w_{ij}}} = -\frac{1}{R}[p_{ij}^{+}-p_{ij}^{-}]</math> where: * <math>p_{ij}^{+}</math> is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the positive phase. * <math>p_{ij}^{-}</math> is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the negative phase. * <math>R</math> denotes the [[learning rate]] This result follows from the fact that at [[thermal equilibrium]] the probability <math>P^{-}(s)</math> of any global state <math>s</math> when the network is free-running is given by the Boltzmann distribution. This learning rule is biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection ([[synapse]], biologically) does not need information about anything other than the two neurons it connects. This is more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as [[backpropagation]]. The training of a Boltzmann machine does not use the [[expectation–maximization algorithm|EM algorithm]], which is heavily used in [[machine learning]]. By minimizing the [[Kullback–Leibler divergence|KL-divergence]], it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step. Training the biases is similar, but uses only single node activity: :<math>\frac{\partial{G}}{\partial{\theta_{i}}} = -\frac{1}{R}[p_{i}^{+}-p_{i}^{-}]</math>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)