Editing Markov chain Monte Carlo (section)

=== Complex Distribution Sampling ===
[[File:Wikipedia logo langevin dynamics.gif|thumb|A simulation of sampling from a Wikipedia-logo-like distribution via Langevin Dynamics and score matching. ]]

Langevin Dynamics are typically used in complex distribution sampling and generative modeling,<ref>{{Cite journal |last=Hinton |first=Geoffrey E. |date=2002-08-01 |title=Training Products of Experts by Minimizing Contrastive Divergence |url=https://ieeexplore.ieee.org/document/6789337 |journal=Neural Computation |volume=14 |issue=8 |pages=1771–1800 |doi=10.1162/089976602760128018 |pmid=12180402 |issn=0899-7667}}</ref><ref name=":0">{{Citation |last1=Song |first1=Yang |title=Generative modeling by estimating gradients of the data distribution |date=2019-12-08 |work=Proceedings of the 33rd International Conference on Neural Information Processing Systems |issue=1067 |pages=11918–11930 |url=https://dl.acm.org/doi/10.5555/3454287.3455354 |access-date=2025-04-28 |place=Red Hook, NY, USA |publisher=Curran Associates Inc. |last2=Ermon |first2=Stefano}}</ref> via an MCMC procedure. Specifically, given the probability density function  <math>p(x)</math>, we use its log gradient <math>\nabla_x \log p(x)</math> as the score function and start from a prior distribution <math>x_0 \sim p_0 </math>. Then, a chain is built by 
:<math>
x_{i+1} = x_i + \epsilon \nabla_x \log p(x) + \sqrt{2 \epsilon}z_i, z_i \sim \mathcal{N}(0, I)
</math>
for <math>i=0, \dots, K</math>. When <math>\epsilon \rightarrow 0</math> and <math>K \rightarrow \infty</math>, <math>x_K</math> converges to a sample from the target distribution <math>p(x)</math>. 

For some complex distribution, if we know its probability density function but find it difficult to directly sample from it, we can apply Langevin Dynamics as an alternate. However, in most cases, especially generative modeling, usually we do not know the exact probability density function of the target distribution we wish to sample from, neither the score function <math>\nabla_x \log p(x)</math>. In this case, score matching methods<ref>{{Cite journal |last=Hyvärinen |first=Aapo |date=2005 |title=Estimation of Non-Normalized Statistical Models by Score Matching |url=https://jmlr.org/papers/v6/hyvarinen05a.html |journal=Journal of Machine Learning Research |volume=6 |issue=24 |pages=695–709 |issn=1533-7928}}</ref><ref name=":1">{{Cite journal |last=Vincent |first=Pascal |date=July 2011 |title=A Connection Between Score Matching and Denoising Autoencoders |url=https://ieeexplore.ieee.org/document/6795935 |journal=Neural Computation |volume=23 |issue=7 |pages=1661–1674 |doi=10.1162/NECO_a_00142 |pmid=21492012 |issn=0899-7667}}</ref><ref name=":2">{{Cite journal |last1=Song |first1=Yang |last2=Garg |first2=Sahaj |last3=Shi |first3=Jiaxin |last4=Ermon |first4=Stefano |date=2020-08-06 |title=Sliced Score Matching: A Scalable Approach to Density and Score Estimation |url=https://proceedings.mlr.press/v115/song20a |journal=Proceedings of the 35th Uncertainty in Artificial Intelligence Conference |language=en |publisher=PMLR |pages=574–584}}</ref> provide feasible solutions, minimizing the [[Fisher information metric]] between a parameterized score-based model <math>s_\theta(x)</math> and the score function without knowing the ground-truth data score. The score function can be estimated on a training dataset by [[stochastic gradient descent]].  

In real cases, however, the training data only takes a small region of the target distribution, and the estimated score functions are inaccurate in other low density regions with fewer available data examples. To overcome this challenge, denoising score matching<ref name=":0" /><ref name=":1" /><ref name=":4">{{Cite journal |last1=Song |first1=Yang |last2=Ermon |first2=Stefano |date=2020-12-06 |title=Improved techniques for training score-based generative models |url=https://dl.acm.org/doi/abs/10.5555/3495724.3496767 |journal=Proceedings of the 34th International Conference on Neural Information Processing Systems |series=NIPS '20 |location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=12438–12448 |isbn=978-1-7138-2954-6}}</ref> methods purturb the available data examples with noise of different scales, which can improve the coverage of low density regions, and use them as the training dataset for the score-base model. Note that the choice of noise scales is tricky, as too large noise will corrupt the original data, while too small noise will not populate the original data to those low density regions. Thus, carefully crafted noise schedules<ref name=":0" /><ref name=":2" /><ref name=":4" /> are applied for higher quality generation.