Stein's lemma
Template:Short description Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory.<ref>Ingersoll, J., Theory of Financial Decision Making, Rowman and Littlefield, 1987: 13-14.</ref> The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.
Note that the name "Stein's lemma" is also commonly used<ref>Template:Cite book</ref> to refer to a different result in the area of statistical hypothesis testing, which connects the error exponents in hypothesis testing with the Kullback–Leibler divergence. This result is also known as the Chernoff–Stein lemma<ref>Template:Cite book</ref> and is not related to the lemma discussed in this article.
StatementEdit
Suppose X is a normally distributed random variable with expectation μ and variance σ2. Further suppose g is a differentiable function for which the two expectations <math>\operatorname{E}(g(X) (X - \mu))</math> and <math>\operatorname{E}(g'(X))</math> both exist. (The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its absolute value.) Then
- <math>\operatorname{E}\bigl(g(X)(X-\mu)\bigr)=\sigma^2 \operatorname{E}\bigl(g'(X)\bigr).</math>
MultidimensionalEdit
In general, suppose X and Y are jointly normally distributed. Then
- <math>\operatorname{Cov}(g(X),Y)= \operatorname{Cov}(X,Y)\operatorname{E}(g'(X)).</math>
For a general multivariate Gaussian random vector <math>(X_1, ..., X_n) \sim \mathcal{N}(\mu, \Sigma)</math> it follows that
- <math>\operatorname{E}\bigl(g(X)(X-\mu)\bigr)=\Sigma\cdot E\bigl(\nabla g(X)\bigr).</math>
Similarly, when <math>\mu = 0</math>, <math display="block">\operatorname{E}[\partial_i g(X) ] = \operatorname{E}[g(X) (\Sigma^{-1}X)_i], \quad \operatorname{E}[\partial_i\partial_j g(X) ] = \operatorname{E}[g(X) ((\Sigma^{-1}X)_i(\Sigma^{-1}X)_j - \Sigma^{-1}_{ij})] </math>
Gradient descentEdit
Stein's lemma can be used to stochastically estimate gradient:<math display="block">\nabla \operatorname{E}_{\epsilon \sim \mathcal{N}(0, I)}\bigl(g(x + \Sigma^{1/2}\epsilon)\bigr) = \Sigma^{-1/2} \operatorname{E}_{\epsilon \sim \mathcal{N}(0, I)}\bigl(g(x + \Sigma^{1/2}\epsilon)\epsilon\bigr) \approx \Sigma^{-1/2} \frac{1}{N} \sum_{i = 1}^N g(x + \Sigma^{1/2}\epsilon_i )\epsilon_i</math>where <math>\epsilon_1, \dots, \epsilon_N</math> are IID samples from the standard normal distribution <math>\mathcal N(0, I)</math>. This form has applications in Stein variational gradient descent<ref>Template:Cite arXiv</ref> and Stein variational policy gradient.<ref>Template:Cite arXiv</ref>
ProofEdit
The univariate probability density function for the univariate normal distribution with expectation 0 and variance 1 is
- <math>\varphi(x)={1 \over \sqrt{2\pi}}e^{-x^2/2}</math>
Since <math>\int x \exp(-x^2/2)\,dx = -\exp(-x^2/2)</math> we get from integration by parts:
- <math>\operatorname{E}[g(X)X]
= \frac{1}{\sqrt{2\pi}}\int g(x) x \exp(-x^2/2)\,dx = \frac{1}{\sqrt{2\pi}}\int g'(x) \exp(-x^2/2)\,dx = \operatorname{E}[g'(X)]</math>.
The case of general variance <math>\sigma^2</math> follows by substitution.
GeneralizationsEdit
Isserlis' theorem is equivalently stated as<math display="block">\operatorname{E}(X_1 f(X_1,\ldots,X_n))=\sum_{i=1}^{n} \operatorname{Cov}(X_1,X_i)\operatorname{E}(\partial_{X_i}f(X_1,\ldots,X_n)).</math>where <math>(X_1,\dots X_{n})</math> is a zero-mean multivariate normal random vector.
Suppose X is in an exponential family, that is, X has the density
- <math>f_\eta(x)=\exp(\eta'T(x) - \Psi(\eta))h(x).</math>
Suppose this density has support <math>(a,b) </math> where <math> a,b </math> could be <math> -\infty ,\infty</math> and as <math>x\rightarrow a\text{ or }b</math>, <math> \exp (\eta'T(x))h(x) g(x) \rightarrow 0</math> where <math>g</math> is any differentiable function such that <math>E|g'(X)|<\infty</math> or <math> \exp (\eta'T(x))h(x) \rightarrow 0 </math> if <math> a,b </math> finite. Then
- <math>E\left[\left(\frac{h'(X)}{h(X)} + \sum \eta_i T_i'(X)\right)\cdot g(X)\right] = -E[g'(X)]. </math>
The derivation is same as the special case, namely, integration by parts.
If we only know <math> X </math> has support <math> \mathbb{R} </math>, then it could be the case that <math> E|g(X)| <\infty \text{ and } E|g'(X)| <\infty </math> but <math> \lim_{x\rightarrow \infty} f_\eta(x) g(x) \not= 0</math>. To see this, simply put <math>g(x)=1 </math> and <math> f_\eta(x) </math> with infinitely spikes towards infinity but still integrable. One such example could be adapted from <math> f(x) = \begin{cases} 1 & x \in [n, n + 2^{-n}) \\ 0 & \text{otherwise} \end{cases} </math> so that <math> f</math> is smooth.
Extensions to elliptically-contoured distributions also exist.<ref> Template:Cite journal</ref><ref> Template:Cite journal</ref><ref> Template:Cite journal</ref>
See alsoEdit
ReferencesEdit
<references/>