Editing Markov chain Monte Carlo (section)

== Applications ==
MCMC methods are primarily used for calculating [[Numerical analysis|numerical approximations]] of [[Multiple integral|multi-dimensional integrals]], for example in [[Bayesian statistics]], [[computational physics]],<ref>{{Cite journal|last1=Kasim|first1=M.F.|last2=Bott|first2=A.F.A.|last3=Tzeferacos|first3=P.|last4=Lamb|first4=D.Q.|last5=Gregori|first5=G.|last6=Vinko|first6=S.M. | date = September 2019 |title=Retrieving fields from proton radiography without source profiles |journal=Physical Review E|volume=100|issue=3|page=033208|doi=10.1103/PhysRevE.100.033208|pmid=31639953|arxiv=1905.12934|bibcode=2019PhRvE.100c3208K|s2cid=170078861}}</ref> [[computational biology]]<ref>{{Cite journal|last1=Gupta|first1=Ankur|last2=Rawlings|first2=James B. | date = April 2014 |title=Comparison of Parameter Estimation Methods in Stochastic Chemical Kinetic Models: Examples in Systems Biology |journal=AIChE Journal|volume=60|issue=4|pages=1253–1268|doi=10.1002/aic.14409 |pmc=4946376|pmid=27429455|bibcode=2014AIChE..60.1253G }}</ref> and [[computational linguistics]].<ref>See Gill 2008.</ref><ref>See Robert & Casella 2004.</ref>

===Bayesian Statistics===
In Bayesian statistics, Markov chain Monte Carlo methods are typically used to calculate [[Moment (mathematics)|moments]] and [[credible interval]]s of [[posterior probability]] distributions. The use of MCMC methods makes it possible to compute large [[Bayesian network#Hierarchical models|hierarchical models]] that require integrations over hundreds to thousands of unknown parameters.<ref>{{cite book|last1=Banerjee|first1=Sudipto|last2=Carlin|first2=Bradley P.|last3=Gelfand|first3=Alan P.|title=Hierarchical Modeling and Analysis for Spatial Data|publisher=CRC Press|isbn=978-1-4398-1917-3|page=xix|edition=Second|date=2014-09-12}}</ref>

=== Statistical Physics ===
Many contemporary research problems in statistical physics can be addressed by approximate solutions using Monte Carlo simulation, which provides valuable insights into the properties of complex systems. Monte Carlo methods are fundamental in computational physics, physical chemistry, and related disciplines, with broad applications including medical physics, where they are employed to model radiation transport for radiation dosimetry calculations.<ref>{{Cite journal |last1=Jia |first1=Xun |last2=Ziegenhein |first2=Peter |last3=Jiang |first3=Steve B. |date=2014-02-21 |title=GPU-based high-performance computing for radiation therapy |journal=Physics in Medicine and Biology |volume=59 |issue=4 |pages=R151–182 |doi=10.1088/0031-9155/59/4/R151 |issn=1361-6560 |pmc=4003902 |pmid=24486639|bibcode=2014PMB....59R.151J }}</ref><ref>{{Cite journal |last=Rogers |first=D. W. O. |date=July 2006 |title=REVIEW: Fifty years of Monte Carlo simulations for medical physics |url=https://ui.adsabs.harvard.edu/abs/2006PMB....51R.287R/abstract |journal=Physics in Medicine and Biology |language=en |volume=51 |issue=13 |pages=R287–R301 |doi=10.1088/0031-9155/51/13/R17 |pmid=16790908 |bibcode=2006PMB....51R.287R |issn=0031-9155}}</ref> Instead of exhaustively analyzing all possible system states, the Monte Carlo method randomly examines a subset of them to form a representative sample, and yields accurate approximations of the system's characteristic properties. As the number of sampled states increases, the error can be further reduced to a lower level.

=== Complex Distribution Sampling ===
[[File:Wikipedia logo langevin dynamics.gif|thumb|A simulation of sampling from a Wikipedia-logo-like distribution via Langevin Dynamics and score matching. ]]

Langevin Dynamics are typically used in complex distribution sampling and generative modeling,<ref>{{Cite journal |last=Hinton |first=Geoffrey E. |date=2002-08-01 |title=Training Products of Experts by Minimizing Contrastive Divergence |url=https://ieeexplore.ieee.org/document/6789337 |journal=Neural Computation |volume=14 |issue=8 |pages=1771–1800 |doi=10.1162/089976602760128018 |pmid=12180402 |issn=0899-7667}}</ref><ref name=":0">{{Citation |last1=Song |first1=Yang |title=Generative modeling by estimating gradients of the data distribution |date=2019-12-08 |work=Proceedings of the 33rd International Conference on Neural Information Processing Systems |issue=1067 |pages=11918–11930 |url=https://dl.acm.org/doi/10.5555/3454287.3455354 |access-date=2025-04-28 |place=Red Hook, NY, USA |publisher=Curran Associates Inc. |last2=Ermon |first2=Stefano}}</ref> via an MCMC procedure. Specifically, given the probability density function  <math>p(x)</math>, we use its log gradient <math>\nabla_x \log p(x)</math> as the score function and start from a prior distribution <math>x_0 \sim p_0 </math>. Then, a chain is built by 
:<math>
x_{i+1} = x_i + \epsilon \nabla_x \log p(x) + \sqrt{2 \epsilon}z_i, z_i \sim \mathcal{N}(0, I)
</math>
for <math>i=0, \dots, K</math>. When <math>\epsilon \rightarrow 0</math> and <math>K \rightarrow \infty</math>, <math>x_K</math> converges to a sample from the target distribution <math>p(x)</math>. 

For some complex distribution, if we know its probability density function but find it difficult to directly sample from it, we can apply Langevin Dynamics as an alternate. However, in most cases, especially generative modeling, usually we do not know the exact probability density function of the target distribution we wish to sample from, neither the score function <math>\nabla_x \log p(x)</math>. In this case, score matching methods<ref>{{Cite journal |last=Hyvärinen |first=Aapo |date=2005 |title=Estimation of Non-Normalized Statistical Models by Score Matching |url=https://jmlr.org/papers/v6/hyvarinen05a.html |journal=Journal of Machine Learning Research |volume=6 |issue=24 |pages=695–709 |issn=1533-7928}}</ref><ref name=":1">{{Cite journal |last=Vincent |first=Pascal |date=July 2011 |title=A Connection Between Score Matching and Denoising Autoencoders |url=https://ieeexplore.ieee.org/document/6795935 |journal=Neural Computation |volume=23 |issue=7 |pages=1661–1674 |doi=10.1162/NECO_a_00142 |pmid=21492012 |issn=0899-7667}}</ref><ref name=":2">{{Cite journal |last1=Song |first1=Yang |last2=Garg |first2=Sahaj |last3=Shi |first3=Jiaxin |last4=Ermon |first4=Stefano |date=2020-08-06 |title=Sliced Score Matching: A Scalable Approach to Density and Score Estimation |url=https://proceedings.mlr.press/v115/song20a |journal=Proceedings of the 35th Uncertainty in Artificial Intelligence Conference |language=en |publisher=PMLR |pages=574–584}}</ref> provide feasible solutions, minimizing the [[Fisher information metric]] between a parameterized score-based model <math>s_\theta(x)</math> and the score function without knowing the ground-truth data score. The score function can be estimated on a training dataset by [[stochastic gradient descent]].  

In real cases, however, the training data only takes a small region of the target distribution, and the estimated score functions are inaccurate in other low density regions with fewer available data examples. To overcome this challenge, denoising score matching<ref name=":0" /><ref name=":1" /><ref name=":4">{{Cite journal |last1=Song |first1=Yang |last2=Ermon |first2=Stefano |date=2020-12-06 |title=Improved techniques for training score-based generative models |url=https://dl.acm.org/doi/abs/10.5555/3495724.3496767 |journal=Proceedings of the 34th International Conference on Neural Information Processing Systems |series=NIPS '20 |location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=12438–12448 |isbn=978-1-7138-2954-6}}</ref> methods purturb the available data examples with noise of different scales, which can improve the coverage of low density regions, and use them as the training dataset for the score-base model. Note that the choice of noise scales is tricky, as too large noise will corrupt the original data, while too small noise will not populate the original data to those low density regions. Thus, carefully crafted noise schedules<ref name=":0" /><ref name=":2" /><ref name=":4" /> are applied for higher quality generation.