Editing Chernoff bound

{{Short description|Exponentially decreasing bounds on tail distributions of random variables}}
In [[probability theory]], a '''Chernoff bound''' is an exponentially decreasing upper bound on the tail of a random variable based on its [[moment generating function]]. The minimum of all such exponential bounds forms ''the'' Chernoff or '''Chernoff-Cramér bound''', which may decay faster than exponential (e.g. [[Sub-Gaussian distribution|sub-Gaussian]]).<ref name="blm">{{Cite book|last=Boucheron|first=Stéphane|url=https://www.worldcat.org/oclc/837517674|title=Concentration Inequalities: a Nonasymptotic Theory of Independence|date=2013|publisher=Oxford University Press|others=Gábor Lugosi, Pascal Massart|isbn=978-0-19-953525-5|location=Oxford|page=21|oclc=837517674}}</ref><ref>{{Cite web|last=Wainwright|first=M.|date=January 22, 2015|title=Basic tail and concentration bounds|url=https://www.stat.berkeley.edu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf|url-status=live|archive-url=https://web.archive.org/web/20160508170739/http://www.stat.berkeley.edu:80/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf |archive-date=2016-05-08 }}</ref> It is especially useful for sums of independent random variables, such as sums of [[Bernoulli random variable]]s.<ref>{{Cite book|last=Vershynin|first=Roman|url=https://www.worldcat.org/oclc/1029247498|title=High-dimensional probability : an introduction with applications in data science|date=2018|isbn=978-1-108-41519-4|location=Cambridge, United Kingdom|oclc=1029247498|page=19}}</ref><ref>{{Cite journal|last=Tropp|first=Joel A.|date=2015-05-26|title=An Introduction to Matrix Concentration Inequalities|url=https://www.nowpublishers.com/article/Details/MAL-048|journal=Foundations and Trends in Machine Learning|language=English|volume=8|issue=1–2|page=60|doi=10.1561/2200000048|arxiv=1501.01571|s2cid=5679583|issn=1935-8237}}</ref>

The bound is commonly named after [[Herman Chernoff]] who described the method in a 1952 paper,<ref>{{Cite journal|last=Chernoff|first=Herman|date=1952|title=A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations|journal=The Annals of Mathematical Statistics|volume=23|issue=4|pages=493–507|doi=10.1214/aoms/1177729330|jstor=2236576|issn=0003-4851|doi-access=free}}</ref> though Chernoff himself attributed it to Herman Rubin.<ref>{{cite book | url=http://www.crcpress.com/product/isbn/9781482204964 | title=Past, Present, and Future of Statistics | chapter=A career in statistics | page=35 | publisher=CRC Press | last1=Chernoff | first1=Herman | editor-first1=Xihong | editor-last1=Lin | editor-first2=Christian | editor-last2=Genest | editor-first3=David L. | editor-last3=Banks | editor-first4=Geert | editor-last4=Molenberghs | editor-first5=David W. | editor-last5=Scott | editor-first6=Jane-Ling | editor-last6=Wang  | editor6-link = Jane-Ling Wang| year=2014 | isbn=9781482204964 | archive-url=https://web.archive.org/web/20150211232731/https://nisla05.niss.org/copss/past-present-future-copss.pdf | archive-date=2015-02-11 | chapter-url=https://nisla05.niss.org/copss/past-present-future-copss.pdf}}</ref> In 1938 [[Harald Cramér]] had published an almost identical concept now known as [[Cramér's theorem (large deviations)|Cramér's theorem]].

It is a sharper bound than the first- or second-moment-based tail bounds such as [[Markov's inequality]] or [[Chebyshev's inequality]], which only yield power-law bounds on tail decay. However, when applied to sums the Chernoff bound requires the random variables to be independent, a condition that is not required by either Markov's inequality or Chebyshev's inequality.

The Chernoff bound is related to the [[Bernstein inequalities (probability theory)|Bernstein inequalities]]. It is also used to prove [[Hoeffding's inequality]], [[Bennett's inequality]], and [[Doob martingale#McDiarmid's inequality|McDiarmid's inequality]].

== Generic Chernoff bounds ==
[[File:Chernoff-bound.svg|thumb|Two-sided Chernoff bound for a [[Chi-squared distribution|chi-square]] random variable]]
The generic Chernoff bound for a random variable <math>X</math> is attained by applying [[Markov's inequality]] to <math>e^{tX}</math> (which is why it is sometimes called the ''exponential Markov'' or ''exponential moments'' bound). For positive <math>t</math> this gives a bound on the [[Survival function|right tail]] of <math>X</math> in terms of its [[moment-generating function]] <math>M(t) = \operatorname E (e^{t X})</math>:

:<math>\operatorname P \left(X \geq a \right) = \operatorname P \left(e^{t X} \geq e^{t a}\right) \leq M(t) e^{-t a} \qquad (t > 0)</math>

Since this bound holds for every positive <math>t</math>, we may take the [[Infimum and supremum|infimum]]:

:<math>\operatorname P \left(X \geq a\right) \leq \inf_{t > 0} M(t) e^{-t a}</math>
Performing the same analysis with negative <math>t</math> we get a similar bound on the [[Cumulative distribution function|left tail]]:
:<math>\operatorname P \left(X \leq a \right) = \operatorname P \left(e^{t X} \geq e^{t a}\right) \leq M(t) e^{-t a} \qquad (t < 0)</math>

and

:<math>\operatorname P \left(X \leq a\right) \leq \inf_{t < 0} M(t) e^{-t a}</math>
The quantity <math>M(t) e^{-t a}</math> can be expressed as the expected value <math>\operatorname E (e^{t X}) e^{-t a}</math>, or equivalently <math>\operatorname E (e^{t (X-a)})</math>.

=== Properties ===

The exponential function is convex, so by [[Jensen's inequality]] <math>\operatorname E (e^{t X}) \ge e^{t \operatorname E (X)}</math>. It follows that the bound on the right tail is greater or equal to one when <math>a \le \operatorname E (X)</math>, and therefore trivial; similarly, the left bound is trivial for <math>a \ge \operatorname E (X)</math>. We may therefore combine the two infima and define the two-sided Chernoff bound:<math display="block">C(a) = \inf_{t} M(t) e^{-t a} </math>which provides an upper bound on the folded [[cumulative distribution function]] of <math>X</math> (folded at the mean, not the median).

The logarithm of the two-sided Chernoff bound is known as the [[rate function]] (or ''Cramér transform'') <math>I = -\log C</math>. It is equivalent to the [[Legendre–Fenchel transformation|Legendre–Fenchel transform]] or [[convex conjugate]] of the [[cumulant generating function]] <math>K = \log M</math>, defined as: <math display="block">I(a) = \sup_{t} at - K(t) </math>The [[Moment-generating function#Important properties|moment generating function]] is [[Logarithmically convex function|log-convex]], so by a property of the convex conjugate, the Chernoff bound must be [[Logarithmically concave function|log-concave]]. The Chernoff bound attains its maximum at the mean, <math>C(\operatorname E(X))=1</math>, and is invariant under translation: <math display="inline">C_{X+k}(a) = C_X(a - k) </math>.

The Chernoff bound is exact if and only if <math>X</math> is a single concentrated mass ([[degenerate distribution]]). The bound is tight only at or beyond the extremes of a bounded random variable, where the infima are attained for infinite <math>t</math>. For unbounded random variables the bound is nowhere tight, though it is asymptotically tight up to sub-exponential factors ("exponentially tight").{{Citation needed|date=February 2023}} Individual moments can provide tighter bounds, at the cost of greater analytical complexity.<ref>{{Cite journal |last1=Philips |first1=Thomas K. |last2=Nelson |first2=Randolph |date=1995 |title=The Moment Bound Is Tighter Than Chernoff's Bound for Positive Tail Probabilities |url=https://www.jstor.org/stable/2684633 |journal=The American Statistician |volume=49 |issue=2 |pages=175–178 |doi=10.2307/2684633 |jstor=2684633 |issn=0003-1305}}</ref>

In practice, the exact Chernoff bound may be unwieldy or difficult to evaluate analytically, in which case a suitable upper bound on the moment (or cumulant) generating function may be used instead (e.g. a sub-parabolic CGF giving a sub-Gaussian Chernoff bound).
{| class="wikitable mw-collapsible"
|+Exact rate functions and Chernoff bounds for common distributions
!Distribution
!<math>\operatorname E (X)</math>
!<math>K(t)</math>
!<math>I(a)</math>
!<math>C(a)</math>
|-
|[[Normal distribution]] 
|<math>0</math>
|<math>\frac{1}{2}\sigma^2t^2</math>
|<math>\frac{1}{2} \left( \frac{a}{\sigma} \right)^2</math>
|<math>\exp \left( {-\frac{a^2}{2\sigma^2}} \right)</math>
|-
|[[Bernoulli distribution]](detailed below)
|<math>p</math>
|<math>\ln \left( 1-p + pe^t \right)</math>
|<math>D_{KL}(a \parallel p)</math>
|<math>\left (\frac{p}{a}\right )^{a} {\left (\frac{1 - p}{1-a}\right )}^{1 - a}</math>
|-
|Standard Bernoulli 
(''H'' is the [[binary entropy function]])
|<math>\frac{1}{2}</math>
|<math>\ln \left( 1 + e^t \right) - \ln(2)</math>
|<math>\ln(2) - H(a)</math>
|<math>\frac{1}{2}a^{-a}(1-a)^{-(1-a)}</math>
|-
|[[Rademacher distribution]]
|<math>0</math>
|<math>\ln \cosh(t)</math>
|<math>\ln(2) - H\left(\frac{1+a}{2}\right)</math>
|<math>\sqrt{(1+a)^{-1-a}(1-a)^{-1+a}}</math>
|-
|[[Gamma distribution]]
|<math>\theta k</math>
|<math>-k\ln(1 - \theta t)</math>
|<math>-k\ln\frac{a}{\theta k} -k + \frac{a}{\theta} </math>
|<math>\left(\frac{a}{\theta k}\right)^k e^{k-a/\theta}</math>
|-
|[[Chi-squared distribution]]
|<math>k</math>
|<math>-\frac{k}{2}\ln (1-2t)</math>
|<math>\frac{k}{2} \left(\frac{a}{k} - 1 - \ln \frac{a}{k} \right)</math><ref>{{Cite journal |last=Ghosh |first=Malay |date=2021-03-04 |title=Exponential Tail Bounds for Chisquared Random Variables |journal=Journal of Statistical Theory and Practice |language=en |volume=15 |issue=2 |pages=35 |doi=10.1007/s42519-020-00156-x |s2cid=233546315 |issn=1559-8616|doi-access=free }}</ref>
|<math>\left( \frac{a}{k} \right)^{k/2} e^{k/2-a/2} </math>
|-
|[[Poisson distribution]]
|<math>\lambda</math>
|<math>\lambda(e^t - 1)</math>
|<math>a \ln (a/\lambda) - a + \lambda</math>
|<math>(a/\lambda)^{-a} e^{a-\lambda}</math>
|}

=== Bounds from below from the MGF ===
Using only the moment generating function, a bound from below on the tail probabilities can be obtained by applying the [[Paley–Zygmund inequality|Paley-Zygmund inequality]] to <math>e^{tX}</math>, yielding: <math display="block">\operatorname P \left(X > a\right) \geq \sup_{t > 0 \and M(t) \geq e^{ta}} \left( 1 - \frac{e^{ta}}{M(t)} \right)^2 \frac{M(t)^2}{M(2t)}</math>(a bound on the left tail is obtained for negative <math>t</math>). Unlike the Chernoff bound however, this result is not exponentially tight.

Theodosopoulos<ref>{{Cite journal |last=Theodosopoulos |first=Ted |date=2007-03-01 |title=A reversion of the Chernoff bound |url=https://www.sciencedirect.com/science/article/pii/S0167715206002884 |journal=Statistics & Probability Letters |language=en |volume=77 |issue=5 |pages=558–565 |doi=10.1016/j.spl.2006.09.003 |s2cid=16139953 |issn=0167-7152|arxiv=math/0501360 }}</ref> constructed a tight(er) MGF-based bound from below using an [[exponential tilting]] procedure.

For particular distributions (such as the [[Binomial distribution|binomial]]) bounds from below of the same exponential order as the Chernoff bound are often available.

== Sums of independent random variables ==
When {{mvar|X}} is the sum of {{mvar|n}} independent random variables {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}}, the moment generating function of {{mvar|X}} is the product of the individual moment generating functions, giving that:

{{NumBlk|:|<math>\Pr(X \geq a) \leq \inf_{t > 0} \frac{\operatorname E \left [\prod_i e^{t\cdot X_i}\right]}{e^{t\cdot a}} = \inf_{t > 0} e^{-t\cdot a}\prod_i\operatorname E\left[ e^{t\cdot X_i}\right].</math>|{{EquationRef|1}}}}

and:

: <math> \Pr (X \leq a) \leq \inf_{t < 0} e^{-ta} \prod_i \operatorname E \left[e^{t X_i} \right ]</math>

Specific Chernoff bounds are attained by calculating the moment-generating function <math>\operatorname E \left[e^{-t\cdot X_i} \right ]</math> for specific instances of the random variables <math>X_i</math>.

When the random variables are also ''identically distributed'' ([[Independent and identically distributed random variables|iid]]), the Chernoff bound for the sum reduces to a simple rescaling of the single-variable Chernoff bound. That is, the Chernoff bound for the ''average'' of ''n'' iid variables is equivalent to the ''n''th power of the Chernoff bound on a single variable (see [[Cramér's theorem (large deviations)|Cramér's theorem]]).

== Sums of independent bounded random variables ==
{{main|Hoeffding's inequality}}

Chernoff bounds may also be applied to general sums of independent, bounded random variables, regardless of their distribution; this is known as [[Hoeffding's inequality]]. The proof follows a similar approach to the other Chernoff bounds, but applying [[Hoeffding's lemma]] to bound the moment generating functions (see [[Hoeffding's inequality]]).  
:'''[[Hoeffding's inequality]].''' Suppose {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} are [[Statistical independence|independent]] random variables taking values in {{math|[a,b].}} Let {{mvar|X}} denote their sum and let {{math|''μ'' {{=}} E[''X'']}} denote the sum's expected value. Then for any <math>t>0</math>,
::<math>\Pr (X \le \mu-t) < e^{-2t^2/(n(b-a)^2)},</math>
::<math>\Pr (X \ge \mu+t) < e^{-2t^2/(n(b-a)^2)}.</math>

== Sums of independent Bernoulli random variables ==
The bounds in the following sections for [[Bernoulli random variable]]s are derived by using that, for a Bernoulli random variable <math>X_i</math> with probability ''p'' of being equal to 1,

:<math>\operatorname E \left[e^{t\cdot X_i} \right] = (1 - p) e^0 + p e^t = 1 + p (e^t -1) \leq e^{p (e^t - 1)}.</math>

One can encounter many flavors of Chernoff bounds: the original ''additive form'' (which gives a bound on the [[Approximation error|absolute error]]) or the more practical ''multiplicative form'' (which bounds the [[Approximation error|error relative]] to the mean).

=== Multiplicative form (relative error) ===
'''Multiplicative Chernoff bound.''' Suppose {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} are [[Statistical independence|independent]] random variables taking values in {{math|{0, 1}.}} Let {{mvar|X}} denote their sum and let {{math|''μ'' {{=}} E[''X'']}} denote the sum's expected value. Then for any {{math|''δ'' > 0}},
:<math>\Pr ( X \ge (1+\delta)\mu) \leq \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^\mu.</math>
A similar proof strategy can be used to show that for {{math|0 < ''δ'' < 1}}

:<math>\Pr(X \le (1-\delta)\mu) \leq \left(\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}\right)^\mu.</math>

The above formula is often unwieldy in practice, so the following looser but more convenient bounds<ref name="MitzenmacherUpfal">{{cite book | url=https://books.google.com/books?id=0bAYl6d7hvkC | title=Probability and Computing: Randomized Algorithms and Probabilistic Analysis | publisher=Cambridge University Press |author1=Mitzenmacher, Michael  |author2=Upfal, Eli | year=2005 | isbn=978-0-521-83540-4}}</ref> are often used, which follow from the inequality <math>\textstyle\frac{2\delta}{2+\delta} \le \log(1+\delta)</math> from [[List of logarithmic identities#Inequalities|the list of logarithmic inequalities]]:

:<math>\Pr( X \ge (1+\delta)\mu)\le e^{-\delta^2\mu/(2+\delta)}, \qquad 0 \le \delta,</math>
:<math>\Pr( X \le (1-\delta)\mu) \le e^{-\delta^2\mu/2}, \qquad 0 < \delta < 1,</math>
:<math>\Pr( |X - \mu| \ge \delta\mu) \le 2e^{-\delta^2\mu/3}, \qquad 0 < \delta < 1.</math>

Notice that the bounds are trivial for <math>\delta = 0</math>. 

In addition, based on the Taylor expansion for the [[Lambert W function]],<ref name="DillencourtGM"> {{cite journal 
 | last1 = Dillencourt
 | first1 = Michael
 | last2 = Goodrich
 | first2 = Michael
 | last3 = Mitzenmacher
 | first3 = Michael
 | title = Leveraging Parameterized Chernoff Bounds for Simplified Algorithm Analyses
 | journal = Information Processing Letters
 | number = 106516
 | year = 2024
 | volume = 187
 | doi = 10.1016/j.ipl.2024.106516
| doi-access = free
 }}</ref>

:<math>\Pr( X \ge R)\le 2^{-xR}, \qquad x > 0, \  R \ge (2^x e -1)\mu.</math>

=== Additive form (absolute error) ===
The following theorem is due to [[Wassily Hoeffding]]<ref>{{cite journal
 |last1=Hoeffding |first1=W.
 |year=1963
 |title=Probability Inequalities for Sums of Bounded Random Variables
 |journal=[[Journal of the American Statistical Association]]
 |volume=58 |issue=301 |pages=13–30
 |doi=10.2307/2282952
 |jstor=2282952
|url=http://repository.lib.ncsu.edu/bitstream/1840.4/2170/1/ISMS_1962_326.pdf
 }}</ref> and hence is called the Chernoff–Hoeffding theorem.

:'''Chernoff–Hoeffding theorem.''' Suppose {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} are [[i.i.d.]] random variables, taking values in {{math|{0, 1}.}} Let {{math|''p'' {{=}} E[''X''<sub>1</sub>]}} and {{math|''ε'' > 0}}.

::<math>\begin{align}
\Pr \left (\frac{1}{n} \sum X_i \geq p + \varepsilon \right ) \leq \left (\left (\frac{p}{p + \varepsilon}\right )^{p+\varepsilon} {\left (\frac{1 - p}{1-p- \varepsilon}\right )}^{1 - p- \varepsilon}\right )^n &= e^{-D(p+\varepsilon\parallel p) n} \\
\Pr \left (\frac{1}{n} \sum X_i \leq p - \varepsilon \right ) \leq \left (\left (\frac{p}{p - \varepsilon}\right )^{p-\varepsilon} {\left (\frac{1 - p}{1-p+ \varepsilon}\right )}^{1 - p+ \varepsilon}\right )^n &= e^{-D(p-\varepsilon\parallel p) n}
\end{align}</math>
:where
::<math> D(x\parallel y) = x \ln \frac{x}{y} + (1-x) \ln \left (\frac{1-x}{1-y} \right )</math>
:is the [[Kullback–Leibler divergence]] between [[Bernoulli distribution|Bernoulli distributed]] random variables with parameters ''x'' and ''y'' respectively. If {{math|''p'' ≥ {{sfrac|1|2}},}} then <math>D(p+\varepsilon\parallel p)\ge \tfrac{\varepsilon^2}{2p(1-p)}</math> which means

::<math> \Pr\left ( \frac{1}{n}\sum X_i>p+x \right ) \leq \exp \left (-\frac{x^2n}{2p(1-p)} \right ).</math>

A simpler bound follows by relaxing the theorem using {{math|''D''(''p'' + ''ε'' {{!!}} ''p'') ≥ 2''ε''<sup>2</sup>}}, which follows from the [[Convex function|convexity]] of {{math|''D''(''p'' + ''ε'' {{!!}} ''p'')}} and the fact that

:<math>\frac{d^2}{d\varepsilon^2} D(p+\varepsilon\parallel p) = \frac{1}{(p+\varepsilon)(1-p-\varepsilon) } \geq 4 =\frac{d^2}{d\varepsilon^2}(2\varepsilon^2).</math>

This result is a special case of [[Hoeffding's inequality]]. Sometimes, the bounds

:<math>
\begin{align}
D( (1+x) p \parallel p) \geq \frac{1}{4} x^2 p, & & & {-\tfrac{1}{2}} \leq x \leq \tfrac{1}{2},\\[6pt]
D(x \parallel y) \geq \frac{3(x-y)^2}{2(2y+x)}, \\[6pt]
D(x \parallel y) \geq \frac{(x-y)^2}{2y}, & & & x \leq y,\\[6pt]
D(x \parallel y) \geq \frac{(x-y)^2}{2x}, & & & x \geq y
\end{align}
</math>

which are stronger for {{math|''p'' < {{sfrac|1|8}},}} are also used.

==Applications==
Chernoff bounds have very useful applications in [[set balancing]] and [[Packet (information technology)|packet]] [[routing]] in [[sparse graph|sparse]] networks.

The set balancing problem arises while designing statistical experiments. Typically while designing a statistical experiment, given the features of each participant in the experiment, we need to know how to divide the participants into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups.<ref name="0bAYl6d7hvkC">Refer to this [https://books.google.com/books?id=0bAYl6d7hvkC&pg=PA71 book section] for more info on the problem.</ref>

Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce [[network congestion]] while routing packets in sparse networks.<ref name="0bAYl6d7hvkC" />

Chernoff bounds are used in [[computational learning theory]] to prove that a learning algorithm is [[Probably approximately correct learning|probably approximately correct]], i.e. with high probability the algorithm has small error on a sufficiently large training data set.<ref>{{cite book |first1=M. |last1=Kearns |first2=U. |last2=Vazirani |title=An Introduction to Computational Learning Theory |at=Chapter 9 (Appendix), pages 190–192 |publisher=MIT Press |year=1994 |isbn=0-262-11193-4 }}</ref>

Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization.<ref name="Alippi2014">{{cite book |first=C. |last=Alippi |chapter=Randomized Algorithms |title=Intelligence for Embedded Systems |publisher=Springer |year=2014 |isbn=978-3-319-05278-6 }}</ref>
The use of the Chernoff bound permits one to abandon the strong—and mostly unrealistic—small perturbation hypothesis (the perturbation magnitude is small). The robustness level can be, in turn, used either to validate or reject a specific algorithmic choice, a hardware implementation or the appropriateness of a solution whose structural parameters are affected by uncertainties.

A simple and common use of Chernoff bounds is for "boosting" of [[randomized algorithm]]s. If one has an algorithm that outputs a guess that is the desired answer with probability ''p'' > 1/2, then one can get a higher success rate by running the algorithm <math>n = \log(1/\delta) 2p/(p - 1/2)^2</math> times and outputting a guess that is output by more than ''n''/2 runs of the algorithm. (There cannot be more than one such guess.) Assuming that these algorithm runs are independent, the probability that more than ''n''/2 of the guesses is correct is equal to the probability that the sum of independent Bernoulli random variables {{math|''X<sub>k</sub>''}} that are 1 with probability ''p'' is more than ''n''/2. This can be shown to be at least <math>1-\delta</math> via the multiplicative Chernoff bound (Corollary 13.3 in Sinclair's class notes, {{math|''μ'' {{=}} ''np''}}).:<ref>{{Cite web|url = http://www.cs.berkeley.edu/~sinclair/cs271/n13.pdf|title = Class notes for the course "Randomness and Computation"|date = Fall 2011|access-date = 30 October 2014|last = Sinclair|first = Alistair|archive-url = https://web.archive.org/web/20141031035717/http://www.cs.berkeley.edu/~sinclair/cs271/n13.pdf|archive-date = 31 October 2014|url-status = dead}}</ref>

:<math>\Pr\left[X > {n \over 2}\right] \ge 1 - e^{-n \left(p - 1/2 \right)^2/(2p)} \geq 1-\delta</math>

== Matrix Chernoff bound ==
{{main|Matrix Chernoff bound}}

[[Rudolf Ahlswede]] and [[Andreas Winter]] introduced a Chernoff bound for matrix-valued random variables.<ref>{{cite journal
 |last1=Ahlswede |first1=R.
 |last2=Winter |first2=A.
 |year=2003
 |title=Strong Converse for Identification via Quantum Channels
 |volume=48 |issue=3 |pages=569–579
 |journal=[[IEEE Transactions on Information Theory]]
 |arxiv=quant-ph/0012127
 |doi=10.1109/18.985947
|s2cid=523176
 }}</ref> The following version of the inequality can be found in the work of Tropp.<ref>{{cite journal
 |last1=Tropp |first1=J.
 |year=2010
 |title=User-friendly tail bounds for sums of random matrices
 |arxiv=1004.4389
 |doi=10.1007/s10208-011-9099-z
 |volume=12
 |issue=4
 |journal=Foundations of Computational Mathematics
 |pages=389–434
|s2cid=17735965
 }}</ref>

Let {{math|''M''<sub>1</sub>, ..., ''M<sub>t</sub>''}} be independent matrix valued random variables such that <math> M_i\in \mathbb{C}^{d_1 \times d_2} </math> and <math> \mathbb{E}[M_i]=0</math>.
Let us denote by <math> \lVert M \rVert </math> the operator norm of the matrix <math> M </math>. If <math> \lVert M_i \rVert \leq \gamma </math> holds almost surely for all <math> i\in\{1,\ldots, t\} </math>, then for every {{math|''ε'' > 0}}

:<math>\Pr\left( \left\| \frac{1}{t} \sum_{i=1}^t M_i \right\| > \varepsilon \right) \leq (d_1+d_2) \exp \left( -\frac{3\varepsilon^2 t}{8\gamma^2} \right).</math>

Notice that in order to conclude that the deviation from 0 is bounded by {{math|''ε''}} with high probability, we need to choose a number of samples <math>t </math> proportional to the logarithm of <math> d_1+d_2 </math>. In general, unfortunately, a dependence on  <math> \log(\min(d_1,d_2)) </math> is inevitable: take for example a diagonal random sign matrix of dimension <math>d\times d </math>. The operator norm of the sum of ''t'' independent samples is precisely the maximum deviation among ''d'' independent random walks of length ''t''. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that ''t'' should grow logarithmically with ''d'' in this scenario.<ref>{{cite arXiv |last1=Magen |first1=A.|author1-link=Avner Magen |last2=Zouzias |first2=A. |year=2011 |title=Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication |class=cs.DM |eprint=1005.2724 }}</ref>

The following theorem can be obtained by assuming ''M'' has low rank, in order to avoid the dependency on the dimensions.

===Theorem without the dependency on the dimensions===
Let {{math|0 < ''ε'' < 1}} and ''M'' be a random symmetric real matrix with <math>\| \operatorname E[M] \| \leq 1 </math> and <math>\| M\| \leq \gamma </math> almost surely. Assume that each element on the support of ''M'' has at most rank ''r''. Set 
:<math> t = \Omega \left( \frac{\gamma\log (\gamma/\varepsilon^2)}{\varepsilon^2} \right).</math>
If <math> r \leq t </math> holds almost surely, then

:<math>\Pr\left(\left\| \frac{1}{t} \sum_{i=1}^t M_i - \operatorname E[M] \right\| > \varepsilon \right) \leq \frac{1}{\mathbf{poly}(t)}</math>

where {{math|''M''<sub>1</sub>, ..., ''M<sub>t</sub>''}} are i.i.d. copies of ''M''.

==Sampling variant==

The following variant of Chernoff's bound can be used to bound the probability that a majority in a population will become a minority in a sample, or vice versa.<ref>{{Cite book | doi = 10.1007/3-540-44676-1_35| chapter = Competitive Auctions for Multiple Digital Goods| title = Algorithms — ESA 2001| volume = 2161| pages = 416| series = Lecture Notes in Computer Science| year = 2001| last1 = Goldberg | first1 = A. V. | last2 = Hartline | first2 = J. D. | isbn = 978-3-540-42493-2| citeseerx = 10.1.1.8.5115}}; lemma 6.1</ref>

Suppose there is a general population ''A'' and a sub-population ''B''&nbsp;⊆&nbsp;''A''. Mark the relative size of the sub-population (|''B''|/|''A''|) by&nbsp;''r''.

Suppose we pick an integer ''k'' and a random sample ''S''&nbsp;⊂&nbsp;''A'' of size ''k''. Mark the relative size of the sub-population in the sample (|''B''∩''S''|/|''S''|) by ''r<sub>S</sub>''.

Then, for every fraction ''d''&nbsp;∈&nbsp;[0,1]:

:<math>\Pr\left(r_S < (1-d)\cdot r\right) < \exp\left(-r\cdot d^2 \cdot \frac k 2\right)</math>

In particular, if ''B'' is a majority in ''A'' (i.e. ''r''&nbsp;>&nbsp;0.5) we can bound the probability that ''B'' will remain majority in ''S''(''r<sub>S</sub>''&nbsp;>&nbsp;0.5) by taking: ''d''&nbsp;=&nbsp;1&nbsp;−&nbsp;1/(2''r''):<ref>See graphs of: [https://www.desmos.com/calculator/eqvyjug0re the bound as a function of ''r'' when ''k'' changes] and [https://www.desmos.com/calculator/nxurzg7bqj the bound as a function of ''k'' when ''r'' changes].</ref>

:<math>\Pr\left(r_S > 0.5\right) > 1 - \exp\left(-r\cdot \left(1 - \frac{1}{2 r}\right)^2 \cdot \frac k 2 \right)</math>

This bound is of course not tight at all. For example, when ''r''&nbsp;=&nbsp;0.5 we get a trivial bound Prob&nbsp;>&nbsp;0.

==Proofs==

===Multiplicative form===
Following the conditions of the multiplicative Chernoff bound, let {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} be independent [[Bernoulli random variable]]s, whose sum is {{math|''X''}}, each having probability ''p<sub>i</sub>'' of being equal to 1. For a Bernoulli variable:

:<math>\operatorname E \left[e^{t\cdot X_i} \right] = (1 - p_i) e^0 + p_i e^t = 1 + p_i (e^t -1) \leq e^{p_i (e^t - 1)}</math>

So, using ({{EquationNote|1}}) with <math>a = (1+\delta)\mu</math> for any <math>\delta>0</math> and where <math>\mu = \operatorname E[X] = \textstyle\sum_{i=1}^n p_i</math>,

:<math>\begin{align}
\Pr (X > (1 + \delta)\mu) &\le \inf_{t \geq 0} \exp(-t(1+\delta)\mu)\prod_{i=1}^n\operatorname{E}[\exp(tX_i)]\\[4pt]
& \leq \inf_{t \geq 0} \exp\Big(-t(1+\delta)\mu + \sum_{i=1}^n p_i(e^t - 1)\Big) \\[4pt]
& = \inf_{t \geq 0} \exp\Big(-t(1+\delta)\mu + (e^t - 1)\mu\Big).
\end{align}</math>

If we simply set {{math|''t'' {{=}} log(1 + ''δ'')}} so that {{math|''t'' > 0}} for {{math|''δ'' > 0}}, we can substitute and find

:<math>\exp\Big(-t(1+\delta)\mu + (e^t - 1)\mu\Big) = \frac{\exp((1+\delta - 1)\mu)}{(1+\delta)^{(1+\delta)\mu}} = \left[\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right]^\mu.</math>

This proves the result desired.

===Chernoff–Hoeffding theorem (additive form)===
Let {{math|''q'' {{=}} ''p'' + ''ε''}}. Taking {{math|''a'' {{=}} ''nq''}} in ({{EquationNote|1}}), we obtain:

:<math>\Pr\left ( \frac{1}{n} \sum X_i \ge q\right )\le \inf_{t>0} \frac{E \left[\prod e^{t X_i}\right]}{e^{tnq}} = \inf_{t>0} \left ( \frac{ E\left[e^{tX_i} \right] }{e^{tq}}\right )^n.</math>

Now, knowing that {{math|Pr(''X<sub>i</sub>'' {{=}} 1) {{=}} ''p'', Pr(''X<sub>i</sub>'' {{=}} 0) {{=}} 1 − ''p''}}, we have

:<math>\left (\frac{\operatorname E\left[e^{tX_i} \right] }{e^{tq}}\right )^n = \left (\frac{p e^t + (1-p)}{e^{tq} }\right )^n = \left ( pe^{(1-q)t} + (1-p)e^{-qt} \right )^n.</math>

Therefore, we can easily compute the infimum, using calculus:

:<math>\frac{d}{dt} \left (pe^{(1-q)t} + (1-p)e^{-qt} \right) = (1-q)pe^{(1-q)t}-q(1-p)e^{-qt}</math>

Setting the equation to zero and solving, we have

:<math>\begin{align}
(1-q)pe^{(1-q)t} &= q(1-p)e^{-qt} \\
(1-q)pe^{t} &= q(1-p)
\end{align}</math>

so that

:<math>e^t = \frac{(1-p)q}{(1-q)p}.</math>

Thus,

:<math>t = \log\left(\frac{(1-p)q}{(1-q)p}\right).</math>

As {{math|''q'' {{=}} ''p'' + ''ε'' > ''p''}}, we see that {{math|''t'' > 0}}, so our bound is satisfied on {{mvar|t}}. Having solved for {{mvar|t}}, we can plug back into the equations above to find that

:<math>\begin{align}
\log \left (pe^{(1-q)t} + (1-p)e^{-qt} \right ) &= \log \left ( e^{-qt}(1-p+pe^t) \right ) \\
&= \log\left (e^{-q \log\left(\frac{(1-p)q}{(1-q)p}\right)}\right) + \log\left(1-p+pe^{\log\left(\frac{1-p}{1-q}\right)}e^{\log\frac{q}{p}}\right ) \\
&= -q\log\frac{1-p}{1-q} -q \log\frac{q}{p} + \log\left(1-p+ p\left(\frac{1-p}{1-q}\right)\frac{q}{p}\right) \\
&= -q\log\frac{1-p}{1-q} -q \log\frac{q}{p} + \log\left(\frac{(1-p)(1-q)}{1-q}+\frac{(1-p)q}{1-q}\right) \\
&= -q \log\frac{q}{p} + \left ( -q\log\frac{1-p}{1-q} + \log\frac{1-p}{1-q} \right ) \\
&= -q\log\frac{q}{p} + (1-q)\log\frac{1-p}{1-q} \\
&= -D(q \parallel p).
\end{align}</math>

We now have our desired result, that

:<math>\Pr \left (\tfrac{1}{n}\sum X_i \ge p + \varepsilon\right ) \le e^{-D(p+\varepsilon\parallel p) n}.</math>

To complete the proof for the symmetric case, we simply define the random variable {{math|''Y<sub>i</sub>'' {{=}} 1 − ''X<sub>i</sub>''}}, apply the same proof, and plug it into our bound.

==See also==

* [[Bernstein inequalities (probability theory)|Bernstein inequalities]]
*[[Concentration inequality]] − a summary of tail-bounds on random variables.
*[[Cramér's theorem (large deviations)|Cramér's theorem]]
*[[Entropic value at risk]]
* [[Hoeffding's inequality]]
*[[Matrix Chernoff bound]]
*[[Moment generating function]]

== References ==
{{Reflist}}

==Further reading==
* {{cite journal
 |last1=Chernoff |first1=H.
 |year=1952
 |title=A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations
 |journal=[[Annals of Mathematical Statistics]]
 |volume=23 |issue=4 |pages=493&ndash;507
 |doi=10.1214/aoms/1177729330
 |jstor=2236576
 |mr=57518
 |zbl=0048.11804
|doi-access=free
 }}
* {{cite journal
 |last1=Chernoff |first1=H.
 |year=1981
 |title=A Note on an Inequality Involving the Normal Distribution
 |journal=[[Annals of Probability]]
 |volume=9 |issue=3 |pages=533–535
 |doi=10.1214/aop/1176994428
 |jstor=2243541
 |mr=614640
 |zbl=0457.60014
|doi-access=free
 }}
* {{cite journal
 |last1=Hagerup |first1=T.
 |last2=Rüb |first2=C.
 |year=1990
 |title=A guided tour of Chernoff bounds
 |journal=[[Information Processing Letters]]
 |volume=33 |issue=6 |pages=305
 |doi=10.1016/0020-0190(90)90214-I
}}
* {{cite journal
 |last=Nielsen |first=F.
 |year=2011
 |title=An Information-Geometric Characterization of Chernoff Information
 |journal=IEEE Signal Processing Letters
 |volume=20
 |issue=3
 |pages=269–272
 |doi=10.1109/LSP.2013.2243726
 |arxiv=1102.2684
|s2cid=15034953
 }}

{{DEFAULTSORT:Chernoff Bound}}
[[Category:Probabilistic inequalities]]