Editing Rejection sampling (section)

==Theory==
The rejection sampling method generates sampling values from a target distribution <math>X</math> with an arbitrary [[probability density function]] <math>f(x)</math> by using a proposal distribution <math>Y</math> with probability density <math>g(x)</math>.  The idea is that one can generate a sample value from <math>X</math> by instead sampling from <math>Y</math> and accepting the sample from <math>Y</math> with probability <math>f(x)/(M g(x))</math>, repeating the draws from <math>Y</math> until a value is accepted.  <math>M</math> here is a constant, finite bound on the likelihood ratio <math>f(x)/g(x)</math>, satisfying <math>M < \infty</math> over the [[Support (mathematics)|support]] of <math>X</math>; in other words, M must satisfy <math>f(x) \leq M g(x)</math> for all values of <math>x</math>.  Note that this requires that the support of <math>Y</math> must include the support of <math>X</math>—in other words, <math>g(x) > 0</math> whenever <math>f(x) > 0</math>.

The validation of this method is the envelope principle: when simulating the pair <math display="inline">(x,v=u\cdot Mg(x))</math>, one produces a uniform simulation over the subgraph of <math display="inline">Mg(x)</math>. Accepting only pairs such that <math display="inline">u<f(x)/(Mg(x))</math> then produces pairs <math>(x,v)</math> uniformly distributed over the subgraph of <math>f(x)</math> and thus, marginally, a simulation from <math>f(x).</math>

This means that, with enough replicates, the algorithm generates a sample from the desired distribution <math>f(x)</math>. There are a number of extensions to this algorithm, such as the [[Metropolis algorithm]].

This method relates to the general field of [[Monte Carlo method|Monte Carlo]] techniques, including [[Markov chain Monte Carlo]] algorithms that also use a proxy distribution to achieve simulation from the target distribution <math>f(x)</math>. It forms the basis for algorithms such as the [[Metropolis algorithm]].

The unconditional acceptance probability is the proportion of proposed samples which are accepted, which is <math display="block">
\begin{align}
\mathbb{P}\left(U\le\frac{f(Y)}{M g(Y)}\right) &= \operatorname{E}\mathbf{1}_{\left[U\le\frac{f(Y)}{M g(Y)}\right]}\\[6pt]
&= \operatorname{E}\left[\operatorname{E}[\mathbf{1}_{\left[U\le\frac{f(Y)}{M g(Y)}\right]}| Y]\right] & (\text{by tower property } ) \\[6pt] 
&= \operatorname{E}\left[\mathbb{P}\left(U\le \frac{f(Y)}{Mg(Y)} \biggr| Y\right) \right]\\[6pt]
&= \operatorname{E}\left[\frac{f(Y)}{M g(Y)}\right] & (\text{because } \Pr(U \leq u) = u, \text{when } U \text{ is uniform on } (0,1)) \\[6pt] 
&=\int\limits_{y: g(y) > 0} \frac{f(y)}{M g(y)} g(y) \, dy\\ [6pt]
&= \frac{1}{M}\int\limits_{y: g(y) > 0} f(y) \, dy \\[6pt]
&= \frac{1}{M} & (\text{since support of } Y \text{ includes support of } X)
\end{align}
</math>where <math>U\sim \mathrm{Unif}(0,1)</math>, and the value of <math>y</math> each time is generated under the density function <math>g(\cdot)</math> of the proposal distribution <math>Y</math>.

The number of samples required from <math>Y</math> to obtain an accepted value thus follows a [[geometric distribution]] with probability <math>1/M</math>, which has mean <math>M</math>.  Intuitively, <math>M</math> is the expected number of the iterations that are needed, as a measure of the computational complexity of the algorithm.

Rewrite the above equation, <math display="block">M=\frac{1}{\mathbb{P}\left(U\le\frac{f(Y)}{M g(Y)}\right)}</math>
Note that <math display="inline">1 \le M<\infty</math>, due to the above formula, where <math display="inline">\mathbb{P}\left(U\le\frac{f(Y)}{M g(Y)}\right)</math> is a probability which can only take values in the interval <math>[0,1]</math>. When <math>M</math> is chosen closer to one, the unconditional acceptance probability is higher the less that ratio varies, since <math>M</math> is the upper bound for the likelihood ratio <math display="inline">f(x)/g(x)</math>.  In practice, a value of <math>M</math> closer to 1 is preferred as it implies fewer rejected samples, on average, and thus fewer iterations of the algorithm.  In this sense, one prefers to have <math>M</math> as small as possible (while still satisfying <math>f(x) \leq M g(x)</math>, which suggests that <math>g(x)</math> should generally resemble <math>f(x)</math> in some way.  Note, however, that <math>M</math> cannot be equal to 1: such would imply that <math>f(x)=g(x)</math>, i.e. that the target and proposal distributions are actually the same distribution.

Rejection sampling is most often used in cases where the form of <math>f(x)</math> makes sampling difficult.  A single iteration of the rejection algorithm requires sampling from the proposal distribution, drawing from a uniform distribution, and evaluating the <math>f(x)/(M g(x))</math> expression.  Rejection sampling is thus more efficient than some other method whenever M times the cost of these operations—which is the expected cost of obtaining a sample with rejection sampling—is lower than the cost of obtaining a sample using the other method.