Editing Gibbs sampling (section)

== Gibbs sampler in Bayesian inference and its relation to information theory ==
Let <math>y</math> denote observations generated from the sampling distribution <math>f(y|\theta)</math> and <math>\pi(\theta)</math> be a prior supported on the parameter space <math>\Theta</math>. Then one of the central goals of the [[Bayesian statistics]] is to approximate the posterior density

:<math>\pi(\theta|y) = \frac{f(y|\theta) \cdot \pi(\theta)}{m(y)}</math>
where the marginal likelihood <math> m(y)  = \int_{\Theta} f(y|\theta) \cdot \pi(\theta) d\theta </math> is assumed to be finite for all <math>y</math>. 

To explain the Gibbs sampler, we additionally assume that the parameter space <math>\Theta</math> is decomposed as

:<math>\Theta = \prod_{i=1}^{K}\Theta_{i} = \Theta_1 \times \cdots \Theta_i \times \cdots \times \Theta_K, \quad\quad (K>1)</math>,
where <math>\times</math> represents the [[Cartesian product]]. Each component parameter space <math>\Theta_i</math> can be a set of scalar components, subvectors, or matrices. 

Define a set <math>\Theta_{-i}</math> that complements the <math>\Theta_i</math>. Essential ingredients of the Gibbs sampler is the <math>i</math>-th full conditional posterior distribution for each <math>i=1,\cdots, K</math>
:<math>\pi(\theta_i|\theta_{-i},y)=\pi(\theta_i|\theta_1, \cdots, \theta_{i-1},\theta_{i+1},\cdots, \theta_{K},y)</math>.

[[File:Gibbs sampler picture.jpg|thumb|500px|A pictorial description of the Gibbs sampling algorithm <ref name="Lee2008">{{Cite journal |last=Lee|first=Se Yoon|  title = Gibbs sampler and coordinate ascent variational inference: A set-theoretical review|journal=Communications in Statistics - Theory and Methods|year=2021|volume=51 |issue=6 |pages=1549–1568 |doi=10.1080/03610926.2021.1921214|arxiv=2008.01006|s2cid=220935477 }}</ref>]]
[[File:Gibbs sampling info eq.jpg|thumb|500px|Schematic description of the information equality associated with the Gibbs sampler at the i-th step within a cycle <ref name="Lee2008" />]]

The following algorithm details a generic Gibbs sampler:

<math>\text{Initialize: pick arbitrary starting value}\,\, \theta^{(1)} = (\theta_1^{(1)},\theta_2^{(1)},\cdots,\theta_i^{(1)},\theta_{i+1}^{(1)},\cdots,\theta_K^{(1)}) </math>

<math>\text{Iterate a Cycle:}\,</math>

<math> \quad\quad  \text{Step 1.     draw}\,\, \theta_1^{(s+1)}\sim \pi(\theta_1|\theta_2^{(s)},\theta_3^{(s)},\cdots,\theta_K^{(s)},y )</math>

<math> \quad\quad  \text{Step 2.     draw}\,\, \theta_2^{(s+1)}\sim \pi(\theta_2|\theta_1^{(s+1)},\theta_3^{(s)},\cdots,\theta_K^{(s)},y )</math>

<math> \quad\quad\quad  \vdots</math>

<math> \quad\quad  \text{Step i.     draw}\,\, \theta_i^{(s+1)}\sim \pi(\theta_i|\theta_1^{(s+1)},\theta_2^{(s+1)},\cdots,
\theta_{i-1}^{(s+1)},\theta_{i+1}^{(s)} ,\cdots, \theta_K^{(s)},y )</math>

<math> \quad \quad \text{Step i+1. draw}\,\, \theta_{i+1}^{(s+1)}\sim \pi(\theta_{i+1}|\theta_1^{(s+1)},\theta_2^{(s+1)},\cdots,
\theta_{i}^{(s+1)},\theta_{i+2}^{(s)} ,\cdots, \theta_K^{(s)},y )</math>

<math> \quad\quad\quad  \vdots</math>

<math> \quad\quad  \text{Step K.     draw}\,\, \theta_K^{(s+1)}\sim \pi(\theta_K|\theta_1^{(s+1)},\theta_2^{(s+1)},\cdots,\theta_{K-1}^{(s+1)},y )</math>

<math>\text{end Iterate}</math>

Note that Gibbs sampler is operated by the iterative Monte Carlo scheme within a cycle. The <math>S</math> number of samples <math>\{\theta^{(s)} \}_{s=1}^{S}</math> drawn by the above algorithm formulates Markov Chains with the invariant distribution to be the target density <math>\pi(\theta|y)</math>.

Now, for each <math>i=1,\cdots,K</math>, define the following information theoretic quantities:

<math>I(\theta_i ; \theta_{-i}) = \text{KL}(\pi(\theta|y)||\pi(\theta_i|y) \cdot \pi(\theta_{-i}|y) )
=\int_{\Theta} \pi(\theta|y) \log \bigg(\frac{\pi(\theta|y)}{\pi(\theta_i|y) \cdot \pi(\theta_{-i}|y)}\bigg)  d\theta,
</math>

<math>H(\theta_{-i}) = -\int_{\Theta_{-i}} \pi(\theta_{-i}|y) \log \pi(\theta_{-i}|y) d\theta_{-i},
</math>

<math>H(\theta_{-i}|\theta_{i}) = -\int_{\Theta} \pi(\theta|y) \log \pi(\theta_{-i}|\theta_{i},y) d\theta,
</math> 

namely, posterior mutual information, posterior differential entropy, and posterior conditional differential entropy, respectively. We can similarly define information theoretic quantities <math>I(\theta_{-i} ; \theta_{i})</math>, <math>H(\theta_{i})</math>, and <math>H(\theta_{i}|\theta_{-i})</math> by interchanging the <math>i</math> and <math>-i</math> in the defined quantities. Then, the following <math>K</math> equations hold.<ref name="Lee2008" />

<math>I(\theta_i ; \theta_{-i}) = H(\theta_{-i}) - H(\theta_{-i}|\theta_{i}) 
=  H(\theta_{i}) - H(\theta_{i}|\theta_{-i}) = I(\theta_{-i} ; \theta_{i}), \quad (i=1,\cdots,K) </math>.

The mutual information <math>I(\theta_i ; \theta_{-i}) </math> quantifies the reduction in uncertainty of random quantity <math>\theta_{i}</math> once we know <math>\theta_{-i}</math>, a posteriori. It vanishes if and only if <math>\theta_{i}</math> and <math>\theta_{-i}</math> are marginally independent, a posterior. The mutual information <math>I(\theta_i ; \theta_{-i})</math> can be interpreted as the quantity that is transmitted from the <math>i</math>-th step to the <math>i+1</math>-th step within a single cycle of the Gibbs sampler.