Editing Boltzmann machine

{{short description|Type of stochastic recurrent neural network}}
[[File:Boltzmannexamplev1.png|thumb|alt=A graphical representation of an example Boltzmann machine.| A graphical representation of an example Boltzmann machine. Each undirected edge represents dependency. In this example there are 3 hidden units and 4 visible units. This is not a restricted Boltzmann machine.]]
A '''Boltzmann machine''' (also called '''Sherrington–Kirkpatrick model with external field''' or '''stochastic Ising model'''), named after [[Ludwig Boltzmann]], is a [[spin glass|spin-glass]] model with an external field, i.e., a [[Spin glass#Sherrington–Kirkpatrick model|Sherrington–Kirkpatrick model]],<ref>{{citation|title=Solvable Model of a Spin-Glass|number=35|year=1975|author1= Sherrington, David|author2=Kirkpatrick, Scott|journal=Physical Review Letters|volume=35|pages=1792–1796|doi=10.1103/PhysRevLett.35.1792|bibcode=1975PhRvL..35.1792S}}</ref> that is a stochastic [[Ising model]]. It is a [[statistical physics]] technique applied in the context of [[cognitive science]].<ref name=":0">{{cite journal |last=Ackley |first=David H. |author2=Hinton, Geoffrey E. |author3=Sejnowski, Terrence J. |year=1985 |title=A Learning Algorithm for Boltzmann Machines |url=http://learning.cs.toronto.edu/~hinton/absps/cogscibm.pdf |journal=[[Cognitive Science (journal)|Cognitive Science]] |volume=9 |issue=1 |pages=147–169 |doi=10.1207/s15516709cog0901_7 |archive-url=https://web.archive.org/web/20110718022336/http://learning.cs.toronto.edu/~hinton/absps/cogscibm.pdf |archive-date=18 July 2011 |doi-access=free}}</ref> It is also classified as a [[Markov random field]].<ref>{{Cite journal|last=Hinton|first=Geoffrey E.|date=2007-05-24|title=Boltzmann machine|journal=Scholarpedia|language=en|volume=2|issue=5|page=1668|doi=10.4249/scholarpedia.1668|bibcode=2007SchpJ...2.1668H|issn=1941-6016|doi-access=free}}</ref>

Boltzmann machines are theoretically intriguing because of the locality and [[Hebbian]] nature of their training algorithm (being trained by Hebb's rule), and because of their [[Parallelism (computing)|parallelism]] and the resemblance of their dynamics to simple [[physical process]]es.  Boltzmann machines with unconstrained connectivity have not been proven useful for practical problems in [[machine learning]] or [[inference]], but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems.<ref>{{cite book|title=International Neural Network Conference|first=Thomas R.|last=Osborn|date=1 January 1990|publisher=Springer Netherlands|pages=[https://archive.org/details/innc90parisinter0001inte/page/785 785]|doi=10.1007/978-94-009-0643-3_76|chapter=Fast Teaching of Boltzmann Machines with Local Inhibition|isbn=978-0-7923-0831-7|chapter-url=https://archive.org/details/innc90parisinter0001inte/page/785}}</ref>

They are named after the [[Boltzmann distribution]] in [[statistical mechanics]], which is used in their [[sampling function]].  They were heavily popularized and promoted by [[Geoffrey Hinton]], [[Terry Sejnowski]] and [[Yann LeCun]] in cognitive sciences communities, particularly in [[machine learning]],<ref name=":0" />  as part of "[[energy-based model]]s" (EBM), because [[Hamiltonian function|Hamiltonians]] of [[spin glasses]] as energy are used as a starting point to define the learning task.<ref>{{citation|title=On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models|number=34|year=2020|author1=Nijkamp, E. |author2=Hill, M. E|author3= Han, T. |journal=Proceedings of the AAAI Conference on Artificial Intelligence|volume=4|pages=5272–5280|doi=10.1609/aaai.v34i04.5973|url=https://ojs.aaai.org/index.php/AAAI/article/view/5973|doi-access=free|arxiv=1903.12370}}</ref>

==Structure==
[[File:Boltzmannexamplev2.png|thumb|right|alt=A graphical representation of an example Boltzmann machine with weight labels.| A graphical representation of a Boltzmann machine with a few weights labeled. Each undirected edge represents dependency and is weighted with weight <math>w_{ij}</math>. In this example there are 3 hidden units (blue) and 4 visible units (white). This is not a restricted Boltzmann machine.]]

A Boltzmann machine, like a [[Spin glass#Sherrington–Kirkpatrick model|Sherrington–Kirkpatrick model]], is a network of units with a total "energy" ([[Hamiltonian function|Hamiltonian]]) defined for the overall network. Its units produce [[Binary number|binary]] results. Boltzmann machine weights are [[stochastic]]. The global energy <math>E</math> in a Boltzmann machine is identical in form to that of [[Hopfield network]]s and [[Ising model]]s:

:<math>E = -\left(\sum_{i<j} w_{ij} \, s_i \, s_j + \sum_i \theta_i \, s_i \right)</math>

Where:
* <math>w_{ij}</math> is the connection strength between unit <math>j</math> and unit <math>i</math>.
* <math>s_i</math> is the state, <math>s_i \in \{0,1\}</math>, of unit <math>i</math>.
* <math>\theta_i</math> is the bias of unit <math>i</math> in the global energy function. (<math>-\theta_i</math> is the activation threshold for the unit.)

Often the weights <math>w_{ij}</math> are represented as a symmetric matrix <math>W=[w_{ij}]</math> with zeros along the diagonal.

==Unit state probability==
The difference in the global energy that results from a single unit <math>i</math> equaling 0 (off) versus 1 (on), written <math>\Delta E_i</math>, assuming a symmetric matrix of weights, is given by:

:<math>\Delta E_i = \sum_{j>i} w_{ij} \, s_j + \sum_{j<i} w_{ji} \, s_j + \theta_i</math>

This can be expressed as the difference of energies of two states:

:<math>\Delta E_i = E_\text{i=off} - E_\text{i=on}</math>

Substituting the energy of each state with its relative probability according to the [[Boltzmann factor]]
(the property of a [[Boltzmann distribution]] that the energy of a state is proportional to the negative log probability of that state)
yields:

:<math>
\Delta E_{i}
    = -k_{B} T \ln(p_\text{i=off})
       - (-k_{B} T \ln(p_\text{i=on})),
</math>

where <math>k_{B}</math> is the [[Boltzmann constant]] and is absorbed into the artificial notion of temperature <math>T</math>.
Noting that the probabilities of the unit being ''on'' or ''off'' sum to <math>1</math> allows for the simplification:

:<math>
-\frac{\Delta E_{i}}{k_{B}T}
    = -\ln(p_{i=\text{on}}) + \ln(p_{i=\text{off}})
    = \ln\Big(\frac{1 - p_{i=\text{on}}}{p_{i=\text{on}}}\Big)
    = \ln(p_{i=\text{on}}^{-1} - 1),
</math>

whence the probability that the <math>i</math>-th unit is given by

:<math>p_{i=\text{on}} = \frac{1}{1+\exp\Big(-\frac{\Delta E_{i}}{k_{B}T}\Big)},</math>

where the [[scalar (physics)|scalar]] <math>T</math> is referred to as the [[temperature]] of the system.
This relation is the source of the [[logistic function]] found in probability expressions in variants of the Boltzmann machine.

==Equilibrium state==
The network runs by repeatedly choosing a unit and resetting its state. After running for long enough at a certain temperature, the probability of a global state of the network depends only upon that global state's energy, according to a [[Boltzmann distribution]], and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at [[thermal equilibrium]]", meaning that the probability distribution of global states has converged. Running the network beginning from a high temperature, its temperature gradually decreases until reaching a [[thermal equilibrium]] at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This process is called [[simulated annealing]].

To train the network so that the chance it will converge to a global state according to an external distribution over these states, the weights must be set so that the global states with the highest probabilities get the lowest energies. This is done by training.

==Training==
The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the [[training set]] is a set of binary vectors over the set V. The distribution over the training set is denoted <math>P^{+}(V)</math>.  

The distribution over global states converges as the Boltzmann machine reaches [[thermal equilibrium]]. We denote this distribution, after we [[Marginal distribution|marginalize]] it over the hidden units, as <math>P^{-}(V)</math>.

Our goal is to approximate the "real" distribution <math>P^{+}(V)</math> using the <math>P^{-}(V)</math> produced by the machine. The similarity of the two distributions is measured by the [[Kullback–Leibler divergence]], <math>G</math>:

:<math>G = \sum_{v}{P^{+}(v)\ln\left({\frac{P^{+}(v)}{P^{-}(v)}}\right)}</math>

where the sum is over all the possible states of <math>V</math>. <math>G</math> is a function of the weights, since they determine the energy of a state, and the energy determines <math>P^{-}(v)</math>, as promised by the Boltzmann distribution. A [[gradient descent]] algorithm over <math>G</math> changes a given weight, <math>w_{ij}</math>,  by subtracting the [[partial derivative]] of <math>G</math> with respect to the weight.

Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to <math>P^{+}</math>). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight, <math>w_{ij}</math>, is given by the equation:<ref name=":0" />

:<math>\frac{\partial{G}}{\partial{w_{ij}}} = -\frac{1}{R}[p_{ij}^{+}-p_{ij}^{-}]</math>

where:
* <math>p_{ij}^{+}</math> is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the positive phase.
* <math>p_{ij}^{-}</math> is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the negative phase.
* <math>R</math> denotes the [[learning rate]]

This result follows from the fact that at [[thermal equilibrium]] the probability <math>P^{-}(s)</math> of any global state <math>s</math> when the network is free-running is given by the Boltzmann distribution.

This learning rule is biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection ([[synapse]], biologically) does not need information about anything other than the two neurons it connects. This is more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as [[backpropagation]].

The training of a Boltzmann machine does not use the [[expectation–maximization algorithm|EM algorithm]], which is heavily used in [[machine learning]]. By minimizing the [[Kullback–Leibler divergence|KL-divergence]], it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.

Training the biases is similar, but uses only single node activity:

:<math>\frac{\partial{G}}{\partial{\theta_{i}}} = -\frac{1}{R}[p_{i}^{+}-p_{i}^{-}]</math>

==Problems==
Theoretically the Boltzmann machine is a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, [[Inpainting|complete]] a partial photograph.

Unfortunately, Boltzmann machines experience a serious practical problem, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial size.{{Citation needed|date=January 2013}} This is due to important effects, specifically:

* the required time order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths{{Citation needed|date=August 2015}}
* connection strengths are more plastic when the connected units have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to follow a [[random walk]] until the activities saturate.

==Types==
===Restricted Boltzmann machine===
[[File:Restricted Boltzmann machine.svg|thumb|right|alt=Graphical representation of an example restricted Boltzmann machine |Graphical representation of a restricted Boltzmann machine. The four blue units represent hidden units, and the three red units represent visible states. In restricted Boltzmann machines there are only connections (dependencies) between hidden and visible units, and none between units of the same type (no hidden-hidden, nor visible-visible connections).]]
{{Main|Restricted Boltzmann machine}}
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in a restricted Boltzmann machine (RBM) which does not allow intralayer connections between hidden units and visible units, i.e. there is no connection between visible to visible and hidden to hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common [[deep learning]] strategies. As each new layer is added the generative model improves.

An extension to the restricted Boltzmann machine allows using real valued data rather than binary data.<ref>{{Citation|title=Recent Developments in Deep Learning| date=22 March 2010 |url=https://www.youtube.com/watch?v=VdIURAu1-aU |archive-url=https://ghostarchive.org/varchive/youtube/20211222/VdIURAu1-aU |archive-date=2021-12-22 |url-status=live|language=en|access-date=2020-02-17}}{{cbignore}}</ref>

One example of a practical RBM application is in speech recognition.<ref>{{cite journal |url=http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASLP.pdf |title=Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition |journal=Microsoft Research |volume=20 |year=2011|last1=Yu |first1=Dong |last2=Dahl |first2=George |last3=Acero |first3=Alex |last4=Deng |first4=Li }}</ref>

===Deep Boltzmann machine===
A deep Boltzmann machine (DBM) is a type of binary pairwise [[Markov random field]] ([[Graph (discrete mathematics)#Undirected graph|undirected]] probabilistic [[graphical model]]) with multiple layers of [[latent variable|hidden]] [[random variables]]. It is a network of symmetrically coupled stochastic [[binary variable|binary units]]. It comprises a set of visible units <math>\boldsymbol{\nu} \in \{0,1\}^D</math> and layers of hidden units <math>\boldsymbol{h}^{(1)} \in \{0,1\}^{F_1}, \boldsymbol{h}^{(2)} \in \{0,1\}^{F_2}, \ldots, \boldsymbol{h}^{(L)} \in \{0,1\}^{F_L}</math>. No connection links units of the same layer (like [[restricted Boltzmann machine|RBM]]). For the {{tooltip|2=Deep Boltzmann machine|DBM}}, the probability assigned to vector {{mvar|'''&nu;'''}} is
: <math>p(\boldsymbol{\nu}) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^{(1)} + \sum_{jl}W_{jl}^{(2)}h_j^{(1)}h_l^{(2)}+\sum_{lm}W_{lm}^{(3)}h_l^{(2)}h_m^{(3)}},</math>
where <math>\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}</math> are the set of hidden units, and <math>\theta = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} </math> are the model parameters, representing visible-hidden and hidden-hidden interactions.<ref name="ref12">{{cite journal|last1=Hinton|first1=Geoffrey|last2=Salakhutdinov|first2=Ruslan|date=2012|title=A better way to pretrain deep Boltzmann machines|url=http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2012_1178.pdf|journal=Advances in Neural|volume=3|pages=1–9|access-date=2017-08-18|archive-url=https://web.archive.org/web/20170813152400/http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2012_1178.pdf|archive-date=2017-08-13}}</ref> In a [[Deep belief network|DBN]] only the top two layers form a restricted Boltzmann machine (which is an undirected [[graphical model]]), while lower layers form a directed generative model. In a DBM all layers are symmetric and undirected.

Like [[deep belief network|DBNs]], DBMs can learn complex and abstract internal representations of the input in tasks such as [[Object recognition|object]] or [[speech recognition]], using limited, labeled data to fine-tune the representations built using a large set of unlabeled sensory input data. However, unlike DBNs and deep [[convolutional neural networks]], they pursue the inference and training procedure in both directions, bottom-up and top-down, which allow the DBM to better unveil the representations of the input structures.<ref name="ref32">{{cite conference |last1=Hinton|first1=Geoffrey|last2=Salakhutdinov|first2=Ruslan|date=2009|title=Efficient Learning of Deep Boltzmann Machines |book-title=Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics |url=http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS09_SalakhutdinovH.pdf|volume=3|pages=448–455|access-date=2017-08-18|archive-url=https://web.archive.org/web/20151106235714/http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS09_SalakhutdinovH.pdf|archive-date=2015-11-06}}</ref><ref name="ref42">{{cite web |last1=Bengio|first1=Yoshua|last2=LeCun|first2=Yann|date=2007|title=Scaling Learning Algorithms towards AI |website=Université de Montréal |url=http://www.iro.umontreal.ca/~lisa/bib/pub_subject/language/pointeurs/bengio+lecun-chapter2007.pdf |type=Preprint}}</ref><ref name="ref22">{{cite conference |last1=Larochelle|first1=Hugo|last2=Salakhutdinov|first2=Ruslan|date=2010|title=Efficient Learning of Deep Boltzmann Machines |book-title=Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics |url=http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_SalakhutdinovL10.pdf|pages=693–700|access-date=2017-08-18|archive-url=https://web.archive.org/web/20170814001329/http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_SalakhutdinovL10.pdf|archive-date=2017-08-14}}</ref>

However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations and approximate the expected sufficient statistics by using [[Markov chain Monte Carlo]] (MCMC).<ref name="ref12" /> This approximate inference, which must be done for each test input, is about 25 to 50 times slower than a single bottom-up pass in DBMs. This makes joint optimization impractical for large data sets, and restricts the use of DBMs for tasks such as feature representation.

===Spike-and-slab RBMs===
The need for deep learning with [[real number|real-valued]] inputs, as in [[Gaussian]] RBMs, led to the spike-and-slab [[restricted Boltzmann machine|RBM]] (''ss''[[restricted Boltzmann machine|RBM]]), which models continuous-valued inputs with [[binary variable|binary]] [[latent variable]]s.<ref name="ref30">{{cite journal|last1=Courville|first1=Aaron|last2=Bergstra|first2=James|last3=Bengio|first3=Yoshua|date=2011|title=A Spike and Slab Restricted Boltzmann Machine|url=http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2011_CourvilleBB11.pdf|journal=JMLR: Workshop and Conference Proceeding|volume=15|pages=233–241|access-date=2019-08-25|archive-url=https://web.archive.org/web/20160304112418/http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2011_CourvilleBB11.pdf|archive-date=2016-03-04}}</ref> Similar to basic [[restricted Boltzmann machine|RBMs]] and its variants, a spike-and-slab RBM is a [[bipartite graph]], while like G[[restricted Boltzmann machine|RBMs]], the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete [[probability mass]] at zero, while a slab is a [[Probability density|density]] over continuous domain;<ref name="ref322">{{cite conference|last1=Courville|first1=Aaron|last2=Bergstra|first2=James|last3=Bengio|first3=Yoshua|date=2011|title=Proceedings of the 28th International Conference on Machine Learning|volume=10|pages=1–8|chapter=Unsupervised Models of Images by Spike-and-Slab RBMs|chapter-url=http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Courville_591.pdf|conference=|access-date=2019-08-25|archive-date=2016-03-04|archive-url=https://web.archive.org/web/20160304054551/http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Courville_591.pdf}}</ref> their mixture forms a [[Prior probability|prior]].<ref name="ref31">{{cite journal|last1=Mitchell|first1=T|last2=Beauchamp|first2=J|date=1988|title=Bayesian Variable Selection in Linear Regression|journal=Journal of the American Statistical Association|volume=83|issue=404|pages=1023–1032|doi=10.1080/01621459.1988.10478694}}</ref>

An extension of ss[[restricted Boltzmann machine|RBM]] called μ-ss[[restricted Boltzmann machine|RBM]] provides extra modeling capacity using additional terms in the [[energy function]]. One of these terms enables the model to form a [[conditional probability distribution|conditional distribution]] of the spike variables by [[marginalizing out]] the slab variables given an observation.

===In mathematics===
{{Main|Gibbs measure|Log-linear model}}

In more general mathematical setting, the Boltzmann distribution is also known as the [[Gibbs measure]]. In [[statistics]] and [[machine learning]] it is called a [[log-linear model]]. In [[deep learning]] the Boltzmann distribution is used in the sampling distribution of [[stochastic neural network]]s such as the Boltzmann machine.

==History==
The Boltzmann machine is based on the Sherrington–Kirkpatrick [[spin glass]] model by [[David Sherrington (physicist)|David Sherrington]] and [[Scott Kirkpatrick]].<ref>{{Cite journal|last1=Sherrington|first1=David|last2=Kirkpatrick|first2=Scott|date=1975-12-29|title=Solvable Model of a Spin-Glass|journal=Physical Review Letters|volume=35|issue=26|pages=1792–1796|doi=10.1103/physrevlett.35.1792|bibcode=1975PhRvL..35.1792S|issn=0031-9007}}</ref> The seminal publication by [[John Hopfield]] (1982) applied methods of statistical mechanics, mainly the recently developed (1970s) theory of spin glasses, to study [[Hopfield network|associative memory]] (later named the "Hopfield network").<ref>{{Cite journal |last=Hopfield, J. J. |year=1982 |title=Neural networks and physical systems with emergent collective computational abilities |journal=Proceedings of the National Academy of Sciences of the United States of America |publisher=[s.n.] |volume=79 |issue=8 |pages=2554–8 |bibcode=1982PNAS...79.2554H |doi=10.1073/pnas.79.8.2554 |oclc=848771572 |pmc=346238 |pmid=6953413 |doi-access=free}}</ref>

The original contribution in applying such energy-based models in cognitive science appeared in papers by [[Geoffrey Hinton]] and [[Terry Sejnowski]].<ref>{{Cite conference|url=http://digitalcollections.library.cmu.edu/awweb/awarchive?type=file&item=360445|title=Analyzing Cooperative Computation|last1=Hinton|first1=Geoffery|date=May 1983|access-date=17 February 2020|first2=Terrence J.|last2=Sejnowski|conference=5th Annual Congress of the Cognitive Science Society|location=Rochester, New York}}{{Dead link|date=June 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref><ref>{{cite conference|first1= Geoffrey E. |last1=Hinton |first2= Terrence J. |last2=Sejnowski|title=Optimal Perceptual Inference |conference=IEEE Conference on Computer Vision and Pattern Recognition (CVPR)|pages= 448–453|publisher= IEEE Computer Society|location= Washington, D.C.|date=June 1983}}</ref><ref>Fahlman SE, Hinton GE, Sejnowski TJ. ''[https://www.cs.toronto.edu/~fritz/absps/fahlmanBM.pdf Massively parallel architectures for Al: NETL, Thistle, and Boltzmann machines.]'' In: Genesereth MR, editor. ''AAAI-83.'' Washington, DC: AAAI; 1983. pp. 109–113</ref> In a 1995 interview, Hinton stated that in 1983 February or March, he was going to give a talk on [[simulated annealing]] in Hopfield networks, so he had to design a learning algorithm for the talk, resulting in the Boltzmann machine learning algorithm.<ref>Chapter 16. Rosenfeld, Edward, and James A. Anderson, eds. 2000. ''Talking Nets: An Oral History of Neural Networks''. Reprint edition. The MIT Press.</ref>

The idea of applying the Ising model with annealed [[Gibbs sampling]] was used in [[Douglas Hofstadter]]'s [[Copycat (software)|Copycat]] project (1984).<ref>{{Cite book|last=Hofstadter, D. R.|title=The Copycat Project: An Experiment in Nondeterminism and Creative Analogies.|date=January 1984|publisher=Defense Technical Information Center|oclc=227617764}}</ref><ref>{{Cite book|editor-last=Caianiello|editor-first=Eduardo R.|title=Physics of cognitive processes|date=1988|publisher=World Scientific|isbn=9971-5-0255-0|oclc=750950619|last=Hofstadter |first=Douglas R. |chapter=A Non-Deterministic Approach to Analogy, Involving the Ising Model of Ferromagnetism|location=Teaneck, New Jersey}}</ref>

The explicit analogy drawn with statistical mechanics in the Boltzmann machine formulation led to the use of terminology borrowed from physics (e.g., "energy"), which became standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the adoption of a variety of concepts and methods from statistical mechanics. The various proposals to use simulated annealing for inference were apparently independent.

Similar ideas (with a change of sign in the energy function) are found in [[Paul Smolensky]]'s "Harmony Theory".<ref>Smolensky, Paul. "Information processing in dynamical systems: Foundations of harmony theory." (1986): 194-281.</ref> Ising models can be generalized to [[Markov random field]]s, which find widespread application in [[linguistics]], [[robotics]], [[computer vision]] and [[artificial intelligence]].

In 2024, Hopfield and Hinton were awarded [[Nobel Prize in Physics]] for their foundational contributions to [[machine learning]], such as the Boltzmann machine.<ref>{{Cite web |last=Johnston |first=Hamish |date=2024-10-08 |title=John Hopfield and Geoffrey Hinton share the 2024 Nobel Prize for Physics |url=https://physicsworld.com/a/john-hopfield-and-geoffrey-hinton-share-the-2024-nobel-prize-for-physics/ |access-date=2024-10-18 |website=Physics World |language=en-GB}}</ref>

==See also==
* [[Restricted Boltzmann machine]]
* [[Helmholtz machine]]
* [[Markov random field]] (MRF)
* [[Ising model]] (Lenz–Ising model)
* [[Hopfield network]]

==References==
{{Reflist}}

==Further reading==
* {{cite journal
 |last1=Hinton
 |first1=G. E.
 |author-link1=Geoffrey Hinton
 |last2=Sejnowski
 |first2=T. J.
 |author-link2=Terry Sejnowski
 |year=1986
 |title=Learning and Relearning in Boltzmann Machines
 |editor=D. E. Rumelhart |editor2=J. L. McClelland
 |journal=Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations
 |pages=282–317
 |url=http://www.cs.toronto.edu/~hinton/absps/pdp7.pdf
 |archive-url=https://web.archive.org/web/20100705230134/http://learning.cs.toronto.edu/~hinton/absps/pdp7.pdf
 |archive-date=2010-07-05
 }}
* {{cite journal 
 |doi=10.1162/089976602760128018 
 |last1=Hinton |first1=G. E. |author-link1=Geoffrey Hinton
 |year=2002
 |title=Training Products of Experts by Minimizing Contrastive Divergence
 |journal=[[Neural Computation (journal)|Neural Computation]]
 |volume=14 
 |issue=8 |pages=1771–1800
 |url=http://www.cs.toronto.edu/~hinton/absps/nccd.pdf 
 |pmid=12180402
|citeseerx=10.1.1.35.8613 |s2cid=207596505 }}
* {{cite journal 
 |doi=10.1162/neco.2006.18.7.1527 
 |last1=Hinton |first1=G. E. |author-link1=Geoffrey Hinton
 |last2=Osindero |first2=S.
 |last3=Teh |first3=Y. 
 |year=2006
 |title=A fast learning algorithm for deep belief nets
 |journal=[[Neural Computation (journal)|Neural Computation]]
 |volume=18 
 |issue=7 |pages=1527–1554 
 |url=http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf 
 |pmid=16764513
|citeseerx=10.1.1.76.1541 |s2cid=2309950 }}
* [https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/?sh=1eca51e55817 Kothari P (2020): https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/?sh=1eca51e55817]
* {{Cite web |last=Montufar|first=Guido|year=2018|title=Restricted Boltzmann Machines: Introduction and Review |type=Preprint |url=https://www.mis.mpg.de/preprints/2018/preprint2018_87.pdf|language=en|website=[[MPI MiS]] |access-date=1 August 2023}}

==External links==
*[https://www.scholarpedia.org/article/Boltzmann_Machine Scholarpedia article by Hinton about Boltzmann machines]
*[https://www.youtube.com/watch?v=AyzOUbkUf3M Talk at Google by Geoffrey Hinton]

{{Statistical mechanics topics}}
{{Authority control}}

[[Category:Neural network architectures]]
[[Category:Ludwig Boltzmann|Machine]]
[[Category:Mathematical physics]]