Hoeffding's inequality
Template:Short description In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.<ref>Template:Harvtxt</ref>
Hoeffding's inequality is a special case of the Azuma–Hoeffding inequality and McDiarmid's inequality. It is similar to the Chernoff bound, but tends to be less sharp, in particular when the variance of the random variables is small.<ref>Template:Harvtxt</ref> It is similar to, but incomparable with, one of Bernstein's inequalities.
StatementEdit
Let Template:Math be independent random variables such that <math>a_i \leq X_i \leq b_i</math> almost surely. Consider the sum of these random variables,
- <math>S_n = X_1 + \cdots + X_n.</math>
Then Hoeffding's theorem states that, for all Template:Math,<ref>Template:Harvtxt</ref>
- <math>\begin{align}
\operatorname{P} \left(S_n - \mathrm{E}\left [S_n \right] \geq t \right) &\leq \exp \left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right) \\ \operatorname{P} \left(\left |S_n - \mathrm{E}\left [S_n \right] \right | \geq t \right) &\leq 2\exp \left(-\frac{2t^2}{\sum_{i=1}^n(b_i - a_i)^2} \right) \end{align}</math>
Here Template:Math is the expected value of Template:Mvar.
Note that the inequalities also hold when the Template:Mvar have been obtained using sampling without replacement; in this case the random variables are not independent anymore. A proof of this statement can be found in Hoeffding's paper. For slightly better bounds in the case of sampling without replacement, see for instance the paper by Template:Harvtxt.
GeneralizationEdit
Let <math>Y_1, \dots, Y_n </math> be independent observations such that <math>\operatorname{E} (Y_i) = 0</math> and <math>a_i \le Y_i \le b_i </math>. Let <math>\epsilon > 0</math>. Then, for any <math>t > 0</math>,<ref>Template:Cite journal</ref><math display="block">P \left( \sum_{i=1}^n Y_i \ge \epsilon \right) \le \exp \left( - t \epsilon + \sum_{i=1}^n t^2 (b_i - a_i)^2 / 8 \right)</math>
Special Case: Bernoulli RVsEdit
Suppose <math>a_i = 0</math> and <math>b_i = 1</math> for all i. This can occur when Xi are independent Bernoulli random variables, though they need not be identically distributed. Then we get the inequality<ref>Template:Harvtxt</ref>
- <math>\begin{align}
\operatorname{P}\left(S_n - \mathrm{E}\left[S_n\right] \geq t\right) &\leq \exp(-2t^2/n)\\ \operatorname{P}\left(\left|S_n - \mathrm{E}\left[S_n\right]\right| \geq t\right) &\leq 2\exp(-2t^2/n) \end{align}</math> or equivalently,
<math display="block">\begin{align} \operatorname{P}\left((S_n - \mathrm{E}\left[S_n\right])/n \geq t\right) &\leq \exp(-2 n t^2 )\\ \operatorname{P}\left(\left|(S_n - \mathrm{E}\left[S_n\right])/n\right| \geq t\right) &\leq 2\exp(-2 n t^2) \end{align}</math>
for all <math>t \geq 0</math>. This is a version of the additive Chernoff bound which is more general, since it allows for random variables that take values between zero and one, but also weaker, since the Chernoff bound gives a better tail bound when the random variables have small variance.
General case of bounded from above random variablesEdit
Hoeffding's inequality can be extended to the case of bounded from above random variables.<ref>Template:Harvtxt</ref>
Let Template:Math be independent random variables such that <math> \mathrm{E}X_{i}=0</math> and <math> X_i \leq b_i</math> almost surely. Denote by
- <math>\begin{align}
C_i^2= \left\{ \begin{array}{ll} \mathrm{E}X_{i}^2 , & \mathrm{if}\ \mathrm{E}X_{i}^2 \geq b^2_{i} , \\ \displaystyle\frac{1}{4}\left( b_{i} + \frac{\mathrm{E}X_{i}^2 }{ b_{i} }\right)^2, & \textrm{otherwise}. \end{array} \right. \end{align}</math>
Hoeffding's inequality for bounded from above random variables states that for all <math>t \geq 0</math>,
- <math>
\mathrm{P}\left( \left| \sum_{i=1}^n X_i \right| \geq t \right) \leq 2\exp\left( -\frac{ t^2}{2\sum_{i=1}^nC_i^2 } \right). </math> In particular, if <math>\mathrm{E}X_{i}^2 \geq b^2_{i}</math> for all <math>i</math>, then for all <math>t \geq 0</math>,
- <math>
\mathrm{P}\left( \left| \sum_{i=1}^n X_i \right| \geq t \right) \leq 2\exp\left( -\frac{ t^2}{2\sum_{i=1}^n \mathrm{E}X_{i}^2 } \right). </math>
General case of sub-Gaussian random variablesEdit
The proof of Hoeffding's inequality can be generalized to any sub-Gaussian distribution. Recall that a random variable Template:Math is called sub-Gaussian,<ref>Template:Harvtxt</ref> if
- <math>\mathrm{P}(|X|\geq t)\leq 2e^{-ct^2},</math>
for some <math>c>0</math>. For any bounded variable Template:Math, <math>\mathrm{P}(|X|\geq t) = 0 \leq 2e^{-ct^2}</math> for <math>t > T</math> for some sufficiently large Template:Math. Then <math>2e^{-cT^2} \leq 2e^{-ct^2}</math> for all <math>t \leq T</math> so taking <math>c = \log(2) / T^2</math> yields
- <math>\mathrm{P}(|X|\geq t)\leq 1 \leq 2e^{-cT^2} \leq 2e^{-ct^2},</math>
for <math>t \leq T</math>. So every bounded variable is sub-Gaussian.
For a random variable Template:Math, the following norm is finite if and only if Template:Math is sub-Gaussian:
- <math>\Vert X \Vert_{\psi_2} := \inf\left\{c\geq 0: \mathrm{E} \left( e^{X^2/c^2} \right) \leq 2\right\}.</math>
Then let Template:Math be independent sub-Gaussian random variables, the general version of the Hoeffding's inequality states that:
- <math>
\mathrm{P}\left( \left| \sum_{i=1}^n (X_i - E[X_i]) \right| \geq t \right) \leq 2\exp\left( -\frac{ct^2}{\sum_{i=1}^n \Vert X_i \Vert^2_{\psi_2}} \right), </math>
where c > 0 is an absolute constant.<ref>Template:Harvtxt</ref>
Equivalently, we can define sub-Gaussian distributions by variance proxy, defined as follows. If there exists some <math>s^2</math> such that <math>\operatorname{E} [e^{(X-\operatorname{E}[X])t}] \leq e^{\frac{s^2t^2}{2}}</math> for all <math>t</math>, then <math>s^2</math> is called a variance proxy, and the smallest such <math>s^2</math> is called the optimal variance proxy, and denoted by <math>\Vert X\Vert_{\mathrm{vp}}^2</math>. In this form, Hoeffding's inequality states<math display="block"> \mathrm{P}\left( \left| \sum_{i=1}^n (X_i - E[X_i]) \right| \geq t \right) \leq 2\exp\left( -\frac{t^2}{2\sum_{i=1}^n \| X_i \|^2_{\mathrm{vp}}} \right), </math>
ProofEdit
We quote:
Template:Math theoremThe proof of Hoeffding's inequality then follows similarly to concentration inequalities like Chernoff bounds.<ref>Template:Harvtxt</ref>Template:Math proof
This proof easily generalizes to the case for sub-Gaussian distributions with variance proxy.
UsageEdit
Confidence intervalsEdit
Hoeffding's inequality can be used to derive confidence intervals. We consider a coin that shows heads with probability Template:Mvar and tails with probability Template:Math. We toss the coin Template:Mvar times, generating Template:Mvar samples <math>X_1,\ldots,X_n</math> (which are i.i.d Bernoulli random variables). The expected number of times the coin comes up heads is Template:Math. Furthermore, the probability that the coin comes up heads at least Template:Mvar times can be exactly quantified by the following expression:
- <math>\operatorname{P}( H(n) \geq k)= \sum_{i=k}^n \binom{n}{i} p^i (1-p)^{n-i},</math>
where Template:Math is the number of heads in Template:Mvar coin tosses.
When Template:Math for some Template:Math, Hoeffding's inequality bounds this probability by a term that is exponentially small in Template:Math:
- <math>\operatorname{P}( H(n) - pn >\varepsilon n)\leq\exp\left(-2\varepsilon^2 n\right).</math>
Since this bound holds on both sides of the mean, Hoeffding's inequality implies that the number of heads that we see is concentrated around its mean, with exponentially small tail.
- <math>\operatorname{P}\left(|H(n) - pn| >\varepsilon n\right)\leq 2\exp\left(-2\varepsilon^2 n\right).</math>
Thinking of <math>\overline{X} = \frac{1}{n}H(n)</math> as the "observed" mean, this probability can be interpreted as the level of significance <math>\alpha</math> (probability of making an error) for a confidence interval around <math>p</math> of size 2Template:Mvar:
- <math>\alpha=\operatorname{P}(\ \overline{X} \notin [p-\varepsilon, p+\varepsilon]) \leq 2e^{-2\varepsilon^2n}</math>
Finding Template:Mvar for opposite inequality sign in the above, i.e. Template:Mvar that violates inequality but not equality above, gives us:
- <math>n\geq \frac{\log(2/\alpha)}{2\varepsilon^2}</math>
Therefore, we require at least <math>\textstyle \frac{\log(2/\alpha)}{2\varepsilon^2}</math> samples to acquire a <math>\textstyle (1-\alpha)</math>-confidence interval <math>\textstyle p \pm \varepsilon</math>.
Hence, the cost of acquiring the confidence interval is sublinear in terms of confidence level and quadratic in terms of precision. Note that there are more efficient methods of estimating a confidence interval.
See alsoEdit
- Concentration inequality – a summary of tail-bounds on random variables.
- Hoeffding's lemma
- Bernstein inequalities (probability theory)
NotesEdit
ReferencesEdit
- Template:Cite journal
- Template:Cite journal
- Template:Cite journal
- Template:Cite book
- Template:Cite book
- Template:Cite news [1].
Template:Refend