Editing Boltzmann machine (section)

==Training==
The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the [[training set]] is a set of binary vectors over the set V. The distribution over the training set is denoted <math>P^{+}(V)</math>.  

The distribution over global states converges as the Boltzmann machine reaches [[thermal equilibrium]]. We denote this distribution, after we [[Marginal distribution|marginalize]] it over the hidden units, as <math>P^{-}(V)</math>.

Our goal is to approximate the "real" distribution <math>P^{+}(V)</math> using the <math>P^{-}(V)</math> produced by the machine. The similarity of the two distributions is measured by the [[Kullback–Leibler divergence]], <math>G</math>:

:<math>G = \sum_{v}{P^{+}(v)\ln\left({\frac{P^{+}(v)}{P^{-}(v)}}\right)}</math>

where the sum is over all the possible states of <math>V</math>. <math>G</math> is a function of the weights, since they determine the energy of a state, and the energy determines <math>P^{-}(v)</math>, as promised by the Boltzmann distribution. A [[gradient descent]] algorithm over <math>G</math> changes a given weight, <math>w_{ij}</math>,  by subtracting the [[partial derivative]] of <math>G</math> with respect to the weight.

Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to <math>P^{+}</math>). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight, <math>w_{ij}</math>, is given by the equation:<ref name=":0" />

:<math>\frac{\partial{G}}{\partial{w_{ij}}} = -\frac{1}{R}[p_{ij}^{+}-p_{ij}^{-}]</math>

where:
* <math>p_{ij}^{+}</math> is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the positive phase.
* <math>p_{ij}^{-}</math> is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the negative phase.
* <math>R</math> denotes the [[learning rate]]

This result follows from the fact that at [[thermal equilibrium]] the probability <math>P^{-}(s)</math> of any global state <math>s</math> when the network is free-running is given by the Boltzmann distribution.

This learning rule is biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection ([[synapse]], biologically) does not need information about anything other than the two neurons it connects. This is more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as [[backpropagation]].

The training of a Boltzmann machine does not use the [[expectation–maximization algorithm|EM algorithm]], which is heavily used in [[machine learning]]. By minimizing the [[Kullback–Leibler divergence|KL-divergence]], it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.

Training the biases is similar, but uses only single node activity:

:<math>\frac{\partial{G}}{\partial{\theta_{i}}} = -\frac{1}{R}[p_{i}^{+}-p_{i}^{-}]</math>