Stein's lemma

Template:Short description Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory.<ref>Ingersoll, J., Theory of Financial Decision Making, Rowman and Littlefield, 1987: 13-14.</ref> The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.

Note that the name "Stein's lemma" is also commonly used<ref>Template:Cite book</ref> to refer to a different result in the area of statistical hypothesis testing, which connects the error exponents in hypothesis testing with the Kullback–Leibler divergence. This result is also known as the Chernoff–Stein lemma<ref>Template:Cite book</ref> and is not related to the lemma discussed in this article.

StatementEdit

Suppose X is a normally distributed random variable with expectation μ and variance σ². Further suppose g is a differentiable function for which the two expectations <math>\operatorname{E}(g(X) (X - \mu))</math> and <math>\operatorname{E}(g'(X))</math> both exist. (The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its absolute value.) Then

<math>\operatorname{E}\bigl(g(X)(X-\mu)\bigr)=\sigma^2 \operatorname{E}\bigl(g'(X)\bigr).</math>

MultidimensionalEdit

In general, suppose X and Y are jointly normally distributed. Then

<math>\operatorname{Cov}(g(X),Y)= \operatorname{Cov}(X,Y)\operatorname{E}(g'(X)).</math>

For a general multivariate Gaussian random vector <math>(X_1, ..., X_n) \sim \mathcal{N}(\mu, \Sigma)</math> it follows that

<math>\operatorname{E}\bigl(g(X)(X-\mu)\bigr)=\Sigma\cdot E\bigl(\nabla g(X)\bigr).</math>

Similarly, when <math>\mu = 0</math>, <math display="block">\operatorname{E}[\partial_i g(X) ] = \operatorname{E}[g(X) (\Sigma^{-1}X)_i], \quad \operatorname{E}[\partial_i\partial_j g(X) ] = \operatorname{E}[g(X) ((\Sigma^{-1}X)_i(\Sigma^{-1}X)_j - \Sigma^{-1}_{ij})] </math>

Gradient descentEdit

Stein's lemma can be used to stochastically estimate gradient:<math display="block">\nabla \operatorname{E}_{\epsilon \sim \mathcal{N}(0, I)}\bigl(g(x + \Sigma^{1/2}\epsilon)\bigr) = \Sigma^{-1/2} \operatorname{E}_{\epsilon \sim \mathcal{N}(0, I)}\bigl(g(x + \Sigma^{1/2}\epsilon)\epsilon\bigr) \approx \Sigma^{-1/2} \frac{1}{N} \sum_{i = 1}^N g(x + \Sigma^{1/2}\epsilon_i )\epsilon_i</math>where <math>\epsilon_1, \dots, \epsilon_N</math> are IID samples from the standard normal distribution <math>\mathcal N(0, I)</math>. This form has applications in Stein variational gradient descent<ref>Template:Cite arXiv</ref> and Stein variational policy gradient.<ref>Template:Cite arXiv</ref>

ProofEdit

The univariate probability density function for the univariate normal distribution with expectation 0 and variance 1 is

<math>\varphi(x)={1 \over \sqrt{2\pi}}e^{-x^2/2}</math>

Since <math>\int x \exp(-x^2/2)\,dx = -\exp(-x^2/2)</math> we get from integration by parts:

<math>\operatorname{E}[g(X)X]

= \frac{1}{\sqrt{2\pi}}\int g(x) x \exp(-x^2/2)\,dx = \frac{1}{\sqrt{2\pi}}\int g'(x) \exp(-x^2/2)\,dx = \operatorname{E}[g'(X)]</math>.

The case of general variance <math>\sigma^2</math> follows by substitution.

GeneralizationsEdit

Isserlis' theorem is equivalently stated as<math display="block">\operatorname{E}(X_1 f(X_1,\ldots,X_n))=\sum_{i=1}^{n} \operatorname{Cov}(X_1,X_i)\operatorname{E}(\partial_{X_i}f(X_1,\ldots,X_n)).</math>where <math>(X_1,\dots X_{n})</math> is a zero-mean multivariate normal random vector.

Suppose X is in an exponential family, that is, X has the density

Suppose this density has support <math>(a,b) </math> where <math> a,b </math> could be <math> -\infty ,\infty</math> and as <math>x\rightarrow a\text{ or }b</math>, <math> \exp (\eta'T(x))h(x) g(x) \rightarrow 0</math> where <math>g</math> is any differentiable function such that <math>E|g'(X)|<\infty</math> or <math> \exp (\eta'T(x))h(x) \rightarrow 0 </math> if <math> a,b </math> finite. Then

<math>E\left[\left(\frac{h'(X)}{h(X)} + \sum \eta_i T_i'(X)\right)\cdot g(X)\right] = -E[g'(X)]. </math>

The derivation is same as the special case, namely, integration by parts.

If we only know <math> X </math> has support <math> \mathbb{R} </math>, then it could be the case that <math> E|g(X)| <\infty \text{ and } E|g'(X)| <\infty </math> but <math> \lim_{x\rightarrow \infty} f_\eta(x) g(x) \not= 0</math>. To see this, simply put <math>g(x)=1 </math> and <math> f_\eta(x) </math> with infinitely spikes towards infinity but still integrable. One such example could be adapted from <math> f(x) = \begin{cases} 1 & x \in [n, n + 2^{-n}) \\ 0 & \text{otherwise} \end{cases} </math> so that <math> f</math> is smooth.

Extensions to elliptically-contoured distributions also exist.<ref> Template:Cite journal</ref><ref> Template:Cite journal</ref><ref> Template:Cite journal</ref>

ReferencesEdit