Editing Empirical Bayes method (section)

==Point estimation==
<!-- where the "INTUITIVE", not-algebraic examples!????? -->

===Robbins' method: non-parametric empirical Bayes (NPEB)===

[[Herbert Robbins|Robbins]]<ref name=Robbins/> considered a case of sampling from a [[mixture distribution|mixed distribution]], where probability for each <math>y_i</math> (conditional on <math>\theta_i</math>) is specified by  a [[Poisson distribution]],

:<math>p(y_i\mid\theta_i)={{\theta_i}^{y_i} e^{-\theta_i} \over {y_i}!}</math>

while the prior on ''θ'' is unspecified except that it is also [[i.i.d.]] from an unknown distribution, with [[cumulative distribution function]] <math>G(\theta)</math>.  Compound sampling arises in a variety of statistical estimation problems, such as accident rates and clinical trials.{{Citation needed|date=February 2012}}  We simply seek a point prediction of <math>\theta_i</math> given all the observed data.  Because the prior is unspecified, we seek to do this without knowledge of ''G''.<ref name=CL/>

Under [[squared error loss]] (SEL), the [[conditional expectation]] E(''θ''<sub>''i''</sub>&nbsp;|&nbsp;''Y''<sub>''i''</sub>&nbsp;=&nbsp;''y''<sub>''i''</sub>) is a reasonable quantity to use for prediction.  For the Poisson compound sampling model, this quantity is

:<math>\operatorname{E}(\theta_i\mid y_i) = {\int (\theta^{y_i+1} e^{-\theta} / {y_i}!)\,dG(\theta) \over {\int (\theta^{y_i} e^{-\theta} / {y_i}!)\,dG(\theta}) }.</math>

This can be simplified by multiplying both the numerator and denominator by <math>({y_i}+1)</math>, yielding

:<math> \operatorname{E}(\theta_i\mid y_i)= {{(y_i + 1) p_G(y_i + 1) }\over {p_G(y_i)}},</math>

where ''p<sub>G</sub>'' is the marginal probability mass function obtained by integrating out ''θ'' over ''G''.

To take advantage of this, Robbins<ref name=Robbins/> suggested estimating the marginals with their empirical frequencies (<math> \#\{Y_j\}</math>), yielding the fully non-parametric estimate as:

:<math> \operatorname{E}(\theta_i\mid y_i) \approx (y_i + 1) { {\#\{Y_j = y_i + 1\}} \over {\#\{ Y_j = y_i\}} },</math>

where <math>\#</math> denotes "number of". (See also [[Good–Turing frequency estimation]].)

;Example – Accident rates

Suppose each customer of an insurance company has an "accident rate" Θ and is insured against accidents; the probability distribution of Θ  is the underlying distribution, and is unknown.  The number of accidents suffered by each customer in a specified time period has a [[Poisson distribution]] with expected value equal to the particular customer's accident rate.  The actual number of accidents experienced by a customer is the observable quantity. A crude way to estimate the underlying probability distribution of the accident rate Θ is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, ... accidents during the specified time period as the corresponding proportion in the observed random sample.  Having done so, it is then desired to predict the accident rate of each customer in the sample.  As above, one may use the [[conditional probability|conditional]] [[expected value]] of the accident rate Θ given the observed number of accidents during the baseline period.  Thus, if a customer suffers six accidents during the baseline period, that customer's estimated accident rate is 7 × [the proportion of the sample who suffered 7 accidents] / [the proportion of the sample who suffered 6 accidents].  Note that if the proportion of people suffering ''k'' accidents is a decreasing function of ''k'', the customer's predicted accident rate will often be lower than their observed number of accidents.

This [[Shrinkage (statistics)|shrinkage]] effect is typical of empirical Bayes analyses.

=== Gaussian ===
Suppose <math>X, Y</math> are random variables, such that <math>Y</math> is observed, but <math>X</math> is hidden. The problem is to find the expectation of <math>X</math>, conditional on <math>Y</math>. Suppose further that <math>Y|X \sim \mathcal N(X, \Sigma)</math>, that is, <math>Y = X+ Z</math>, where <math>Z</math> is a [[Multivariate normal distribution|multivariate gaussian]] with variance <math>\Sigma</math>.

Then, we have the formula <math display="block">\Sigma \nabla_y \rho(y|x) = \rho(y|x) (x-y)</math>by direct calculation with the probability density function of multivariate gaussians. Integrating over <math>\rho(x)dx</math>, we obtain<math display="block">\Sigma \nabla_y \rho(y) = (\mathbb{E}[x|y] - y) \rho(y) \implies \mathbb{E}[x|y] = y + \Sigma \nabla_y \ln \rho(y)</math>In particular, this means that one can perform Bayesian estimation of <math>X</math> without access to either the prior density of <math>X</math> or the posterior density of <math>Y</math>. The only requirement is to have access to the [[Score (statistics)|score function]] of <math>Y</math>. This has applications in [[Diffusion model#Score-based generative model|score-based generative modeling]].<ref>{{Cite journal |last=Saremi |first=Saeed |last2=Hyvärinen |first2=Aapo |date=2019 |title=Neural Empirical Bayes |url=https://www.jmlr.org/papers/v20/19-216.html |journal=Journal of Machine Learning Research |volume=20 |issue=181 |pages=1–23 |issn=1533-7928}}</ref>

===Parametric empirical Bayes===

If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simple [[conjugate prior]]s), then the empirical Bayes problem  is only to estimate the marginal <math>m(y\mid\eta)</math> and the hyperparameters <math>\eta</math> using the complete set of empirical measurements.   For example, one common approach, called parametric empirical Bayes point estimation, is to approximate the marginal using the [[maximum likelihood estimate]] (MLE), or a [[Moment (mathematics)|moments]] expansion, which allows one to express the hyperparameters <math>\eta</math> in terms of the empirical mean and variance.  This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior <math>\theta</math>.  The resulting equation for the prior <math>\theta</math> is greatly simplified, as shown below.

There are several common parametric empirical Bayes models, including the [[Poisson–gamma model]] (below), the [[Beta-binomial model]], the [[Gaussian–Gaussian model]], the [[Dirichlet-multinomial distribution|Dirichlet-multinomial model]], as well specific models for [[Bayesian linear regression]] (see below) and [[Bayesian multivariate linear regression]]. More advanced approaches include [[hierarchical Bayes model]]s and [[Bayesian mixture model]]s.

====Gaussian–Gaussian model====

For an example of empirical Bayes estimation using a Gaussian-Gaussian model, see [[Bayes_estimator#Empirical_Bayes_estimators|Empirical Bayes estimators]].

====Poisson–gamma model====

For example, in the example above, let the likelihood be a [[Poisson distribution]], and let the prior now be specified by the [[conjugate prior]], which is a [[gamma distribution]] (<math>G(\alpha,\beta)</math>) (where <math>\eta = (\alpha,\beta)</math>):

:<math> \rho(\theta\mid\alpha,\beta) \, d\theta = \frac{(\theta/\beta)^{\alpha-1} \, e^{-\theta / \beta} }{\Gamma(\alpha)} \, (d\theta/\beta) \text{ for } \theta > 0, \alpha > 0, \beta > 0 \,\! .</math>

It is straightforward to show the [[Posterior probability|posterior]] is also a gamma distribution.  Write

:<math> \rho(\theta\mid y) \propto \rho(y\mid \theta) \rho(\theta\mid\alpha, \beta) ,</math>

where the marginal distribution has been omitted since it does not depend explicitly on <math>\theta</math>.
Expanding terms which do depend on <math>\theta</math> gives the posterior as:

:<math> \rho(\theta\mid y) \propto  (\theta^y\, e^{-\theta}) (\theta^{\alpha-1}\, e^{-\theta / \beta}) = \theta^{y+ \alpha -1}\, e^{- \theta (1+1 / \beta)} . </math>

So the posterior density is also a [[gamma distribution]] <math>G(\alpha',\beta')</math>, where <math>\alpha' = y + \alpha</math>, and <math>\beta' = (1+1 / \beta)^{-1}</math>.  Also notice that the marginal is simply the integral of the posterior over all <math>\Theta</math>, which turns out to be a [[negative binomial distribution]].

To apply empirical Bayes, we will approximate the marginal using the [[maximum likelihood]] estimate (MLE). But since the posterior is a gamma distribution, the MLE of the marginal turns out to be just the mean of the posterior, which is the point estimate <math>\operatorname{E}(\theta\mid y)</math> we need. Recalling that the mean <math>\mu</math> of a gamma distribution <math>G(\alpha', \beta')</math> is simply <math>\alpha' \beta'</math>, we have

:<math> \operatorname{E}(\theta\mid y) = \alpha' \beta' = \frac{\bar{y}+\alpha}{1+1 / \beta} = \frac{\beta}{1+\beta}\bar{y} + \frac{1}{1+\beta} (\alpha \beta).  </math>

To obtain the values of <math>\alpha</math> and <math>\beta</math>, empirical Bayes prescribes estimating mean <math>\alpha\beta</math> and variance <math>\alpha\beta^2</math> using the complete set of empirical data.

The resulting point estimate <math> \operatorname{E}(\theta\mid y) </math> is therefore like a weighted average of the sample mean <math>\bar{y}</math> and the prior mean <math>\mu = \alpha\beta</math>.  This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e. mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance).