Editing Markov chain Monte Carlo (section)

== Convergence ==
Usually it is not hard to construct a Markov chain with the desired properties. The more difficult problem is to determine (1) when to start collecting statistics and (2) how many steps are needed to converge to the stationary distribution within an acceptable error.<ref name="Colwes and Carlin, 1996">{{cite journal|last1=Cowles|first1=M.K.|last2=Carlin|first2=B.P.|title=Markov chain Monte Carlo convergence diagnostics: a comparative review|journal=Journal of the American Statistical Association|date=1996|volume=91|issue=434|pages=883–904|doi=10.1080/01621459.1996.10476956|citeseerx=10.1.1.53.3445}}</ref><ref>{{Cite journal |last=Roy |first=Vivekananda |date=2020-03-07 |title=Convergence Diagnostics for Markov Chain Monte Carlo |url=https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-031219-041300 |journal=Annual Review of Statistics and Its Application |language=en |volume=7 |issue= 1|pages=387–412 |doi=10.1146/annurev-statistics-031219-041300 |arxiv=1909.11827 |bibcode=2020AnRSA...7..387R |issn=2326-8298}}</ref> Fortunately, there are a variety of practical diagnostics to empirically assess convergence.

=== Total Variation Distance ===

Formally, let <math>\pi</math> denote the stationary distribution and <math>P^t(x, \cdot)</math> the distribution of the Markov chain after <math>t</math> steps starting from state <math>x</math>. Theoretically, convergence can be quantified by measuring the  [[Total variation distance of probability measures|total variation distance]]:

:<math>
d_{\text{TV}}(P^t(x,\cdot), \pi) = \sup_{A} |P^t(x,A) - \pi(A)|
</math>

A chain is said to mix rapidly if <math>d_{\text{TV}}(P^t(x,\cdot),\pi) \leq \epsilon</math> for all <math>x \in \mathcal{X}</math> within a small number of steps <math>t</math> under a pre-defined tolerance <math>\epsilon > 0</math>. In other words, the stationary distribution is reached quickly starting from an arbitrary position, and the minimum such <math>t</math> is known as the [[Markov chain mixing time|mixing time]]. In practice, however, the total variation distance is generally intractable to compute, especially in high-dimensional problems or when the stationary distribution is only known up to a normalizing constant (as in most Bayesian applications).

=== Gelman-Rubin Diagnostics ===

The [[Gelman-Rubin statistic]], also known as the '''potential scale reduction factor (PSRF)''', evaluates MCMC convergence by sampling multiple independent Markov chains and comparing within-chain and between-chain variances.<ref name="Gelman and Rubin, 1992">{{cite journal |last1=Gelman |first1=A. |last2=Rubin |first2=D.B. |date=1992 |title=Inference from iterative simulation using multiple sequences (with discussion) |url=http://www.stat.duke.edu/~scs/Courses/Stat376/Papers/ConvergeDiagnostics/GelmanRubinStatSci1992.pdf |journal=Statistical Science |volume=7 |issue=4 |pages=457–511 |bibcode=1992StaSc...7..457G |doi=10.1214/ss/1177011136 |doi-access=free}}</ref> If all chains have converged to the same stationary distribution, the between-chain and within-chain variances should be similar, and thus the PSRF must approach 1. In practice, a value of <math>< 1.1</math> is often taken as evidence of convergence. Higher values suggest that the chains are still exploring different parts of the target distribution.

=== Geweke Diagnostics ===

The Geweke diagnostic examines whether the distribution of samples in the early part of the Markov chain is statistically indistinguishable from the distribution in a later part.<ref>{{Citation |last=Geweke |first=John |title=Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments |date=1992-08-13 |work=Bayesian Statistics 4 |pages=169–194 |editor-last=Bernardo |editor-first=J M |url=https://academic.oup.com/book/54041/chapter/422209572 |access-date=2025-04-29 |publisher=Oxford University PressOxford |language=en |doi=10.1093/oso/9780198522669.003.0010 |isbn=978-0-19-852266-9 |editor2-last=Berger |editor2-first=J O |editor3-last=Dawid |editor3-first=P |editor4-last=Smith |editor4-first=A F M}}</ref> Given a sequence of correlated MCMC samples <math>\{X_1, X_2, \dots, X_n\}</math>, the diagnostic splits the chain into an early segment consisting of the first <math>n_A</math> samples, typically chosen as <math>n_A=0.1n</math> (i.e., the first 10% of the chain), and a late segment consisting of the last <math>n_B</math> samples, typically chosen as <math>n_B=0.5n</math> (i.e., the last 50% of the chain)

Denote the sample means of these segments as:

:<math>
\bar{X}_A = \dfrac{1}{n_A}\sum_{i=1}^{n_A} X_i,\;\;\; \bar{X}_B = \dfrac{1}{n_B}\sum_{i=n-n_B+1}^n X_i
</math>

Since MCMC samples are autocorrelated, a simple comparison of sample means is insufficient. Instead, the difference in means is standardized using an estimator of the spectral density at zero frequency, which accounts for the long-range dependencies in the chain. The test statistic is computed as:

:<math>
Z = \dfrac{\bar{X}_A - \bar{X}_B}{\sqrt{\hat{S}(0)/n_A + \hat{S}(0)/n_B}}
</math>
where <math>\hat{S}(0)</math> is an estimate of the long-run variance (i.e., the spectral density at frequency zero), commonly estimated using [[Newey–West estimator|Newey-West estimators]] or batch means. Under the null hypothesis of convergence, the statistic <math>Z</math> follows an approximately standard normal distribution <math>Z \sim \mathcal{N}(0,1)</math>.

If <math>|Z| > 1.96</math>, the null hypothesis is rejected at the 5% significance level, suggesting that the chain has not yet reached stationarity.

=== Heidelberger-Welch Diagnostics ===

The Heidelberger-Welch diagnostic is grounded in [[Spectral theory|spectral analysis]] and [[Brownian motion|Brownian motion theory]], and is particularly useful in the early stages of simulation to determine appropriate burn-in and stopping time.<ref>{{Cite journal |last1=Heidelberger |first1=Philip |last2=Welch |first2=Peter D. |date=1981-04-01 |title=A spectral method for confidence interval generation and run length control in simulations |url=https://dl.acm.org/doi/10.1145/358598.358630 |journal=Commun. ACM |volume=24 |issue=4 |pages=233–245 |doi=10.1145/358598.358630 |issn=0001-0782}}</ref><ref>{{Cite journal |last1=Heidelberger |first1=Philip |last2=Welch |first2=Peter D. |date=1983-12-01 |title=Simulation Run Length Control in the Presence of an Initial Transient |url=https://pubsonline.informs.org/doi/10.1287/opre.31.6.1109 |journal=Operations Research |volume=31 |issue=6 |pages=1109–1144 |doi=10.1287/opre.31.6.1109 |issn=0030-364X}}</ref> The diagnostic consists of two components, a '''stationarity test''' that assesses whether the Markov chain has reached a steady-state, and a '''half-width test''' that determines whether the estimated expectation is within a user-specified precision.

==== Stationary Test ====

Let <math>\{X_t\}_{t=1}^n</math> be the output of an MCMC simulation for a scalar function <math>g(X_t)</math>, and <math>g_1,g_2,\dots,g_n</math> the evaluations of the function <math>g</math> over the chain. Define the standardized cumulative sum process:

:<math>
B_n(t) = \dfrac{\sum_{i=1}^{\text{round}(nt)} g_i - \text{round}(nt) \bar{g}_n}{\sqrt{n\hat{S}(0)}},\;\;\; t\in[0,1]
</math> 
where <math>\bar{g}_n = \frac{1}{n}\sum_{i=1}^n g_i</math> is the sample mean and <math>\hat{S}(0)</math> is an estimate of the spectral density at frequency zero.

Under the null hypothesis of convergence, the process <math>B_n(t)</math> converges in distribution to a [[Brownian bridge]]. The following [[Cramér–von Mises criterion|Cramér-von Mises statistic]] is used to test for stationarity:

:<math>
C_n = \int_0^1 B_n(t)^2 dt.
</math>

This statistic is compared against known critical values from the Brownian bridge distribution. If the null hypothesis is rejected, the first 10% of the samples are discarded and the test can be repeated on the remaining chain until either stationarity is accepted or 50% of the chain is discarded.

==== Half-Width Test (Precision Check) ====

Once stationarity is accepted, the second part of the diagnostic checks whether the Monte Carlo estimator is accurate enough for practical use. Assuming the central limit theorem holds, the confidence interval for the mean <math>\mathbb{E}_\pi[g(X)]</math> is given by
:<math>
\bar{g}_n \pm t_{\alpha/2,\nu} \cdot \dfrac{\hat{\sigma}_n}{\sqrt{n}}
</math>
where <math>\hat{\sigma}^2</math> is an estimate of the variance of <math>g(X)</math>, <math>t_{\alpha/2,\nu}</math> is the [[Student's t-test|Student's <math>t</math>]] critical value at confidence level <math>1 - \alpha</math> and degrees of freedom <math>\nu</math>, <math>n</math> is the number of samples used.

The '''half-width''' of this interval is defined as
:<math>
t_{\alpha/2,\nu} \cdot \dfrac{\hat{\sigma}_n}{\sqrt{n}}
</math>
If the half-width is smaller than a user-defined tolerance (e.g., 0.05), the chain is considered long enough to estimate the expectation reliably. Otherwise, the simulation should be extended.

=== Raftery-Lewis Diagnostics ===

The Raftery-Lewis diagnostic is specifically designed to assess how many iterations are needed to estimate quantiles or tail probabilities of the target distribution with a desired accuracy and confidence.<ref>{{Cite journal |last1=Raftery |first1=Adrian E. |last2=Lewis |first2=Steven M. |date=1992-11-01 |title=[Practical Markov Chain Monte Carlo]: Comment: One Long Run with Diagnostics: Implementation Strategies for Markov Chain Monte Carlo |url=https://projecteuclid.org/journals/statistical-science/volume-7/issue-4/Practical-Markov-Chain-Monte-Carlo--Comment--One-Long/10.1214/ss/1177011143.full |journal=Statistical Science |volume=7 |issue=4 |doi=10.1214/ss/1177011143 |issn=0883-4237}}</ref> Unlike Gelman-Rubin or Geweke diagnostics, which are based on assessing convergence to the entire distribution, the Raftery-Lewis diagnostic is goal-oriented as it provides estimates for the number of samples required to estimate a specific quantile of interest within a desired margin of error.

Let <math>q</math> denote the desired quantile (e.g., 0.025) of a real-valued function <math>g(X)</math>: in other words, the goal is to find <math>u</math> such that <math>P(g(X) \leq u) = q</math>. Suppose we wish to estimate this quantile such that the estimate falls within margin <math>\varepsilon</math> of the true value with probability <math>1 - \alpha</math>. That is, we want
:<math>P(|\hat{q} - q| < \varepsilon) \geq 1 - \alpha</math>

The diagnostic proceeds by converting the output of the MCMC chain into a binary sequence:
:<math>
W_n = \mathbb{I}(g(X_n) \leq u), \;\;\; n=1,2,\dots
</math>
where <math>I(\cdot)</math> is the indicator function. The sequence <math>\{W_n\}</math> is treated as a realization from a two-state Markov chain. While this may not be strictly true, it is often a good approximation in practice.

From the empirical transitions in the binary sequence, the Raftery-Lewis method estimates:
* The minimum number of iterations <math>n_{\text{min}}</math> required to achieve the desired precision and confidence for estimating the quantile is obtained based on asymptotic theory for Bernoulli processes:
:<math>
n_{\text{min}} = \bigg\{\Phi^{-1}\bigg(1-\dfrac{\alpha}{2}\bigg)\bigg\}^2 \dfrac{q(1-q)}{\varepsilon^2}
</math>
where <math>\Phi^{-1}(\cdot)</math> is the standard normal quantile function.
* The burn-in period <math>n_{\text{burn}}</math> is calculated using eigenvalue analysis of the transition matrix to estimate the number of initial iterations needed for the Markov chain to forget its initial state.