Editing Hypergeometric distribution (section)

== Properties  ==
===Working example===
The classical application of the hypergeometric distribution is '''sampling without replacement'''. Think of an [[urn problem|urn]] with two colors of [[marbles]], red and green. Define drawing a green marble as a success and drawing a red marble as a failure. Let ''N'' describe the number of '''all marbles in the urn''' (see contingency table below) and ''K'' describe the number of '''green marbles''', then ''N''&nbsp;−&nbsp;''K'' corresponds to the number of '''red marbles'''. Now, standing next to the urn, you close your eyes and draw n marbles without replacement. Define ''X'' as a [[random variable]] whose outcome is ''k'', the number of green marbles drawn in the experiment. This situation is illustrated by the following [[contingency table]]:
<!-- Formatting problem: tables overlap in Firefox with low resolution unless aligned by right. Please keep align=right!
{| class="wikitable" style="float:right; margin-left:1em;"
|-
!
! drawn
! not drawn
! total
|-
| align="right" | '''defective'''
| align="right" | ''k''
| align="right" | ''K'' − ''k''
| align="right" | ''K''
|-
| align="right" | '''non-defective'''
| align="right" | ''n'' − ''k''
| align="right" | ''N − K − n + k''
| align="right" | ''N − K''
|-
| align="right" | '''total'''
td align="right">''n''
| align="right" | ''N − n''
| align="right" | ''N''
|} {{Clearright}}-->
{| class="wikitable" style="text-align:center"
! || drawn || not drawn || total
|-
| align="right" | '''green marbles''' || ''k'' || ''K'' − ''k'' || ''K''
|-
| align="right" | '''red marbles''' || ''n'' − ''k'' || ''N + k − n − K'' || ''N − K''
|-
| align="right" | '''total''' || ''n'' || ''N − n'' || ''N''
|-
|}

Indeed, we are interested in calculating the probability of drawing k green marbles in n draws, given that there are K green marbles out of a total of N marbles. For this example, assume that there are '''5''' green and '''45''' red marbles in the urn. Standing next to the urn, you close your eyes and draw '''10''' marbles without replacement. What is the probability that exactly '''4''' of the '''10''' are green?

This problem is summarized by the following contingency table:
{| class="wikitable" style="text-align:center"
|-
! !! drawn !! not drawn !! total
|-
| align="right" | '''green marbles'''
| ''k'' = '''4'''
| ''K'' − ''k'' = '''1'''
| ''K'' = '''5'''
|-
| align="right" | '''red marbles'''
| ''n'' − ''k'' = '''6'''
| ''N + k − n − K'' = '''39'''
| ''N − K'' = '''45'''
|-
| align="right" | '''total'''
| ''n'' = '''10'''
| ''N − n'' = '''40'''
| ''N'' = '''50'''
|}
To find the probability of '''drawing k green marbles in exactly n draws out of N total draws''', we identify X as a hyper-geometric random variable to use the formula

<math> P(X=k) = f(k;N,K,n) = {{{K \choose k} {{N-K} \choose {n-k}}}\over {N \choose n}}.</math> 

To intuitively explain the given formula, consider the two symmetric problems represented by the identity

<math> {{K \choose k} {N-K \choose n-k}\over {N \choose n}} = 
 {{{n \choose k} {{N-n} \choose {K-k}}} \over {N \choose K}}</math> 

# left-hand side - drawing a total of only n marbles out of the urn. We want to find the probability of the outcome of '''drawing k green marbles out of K total green marbles, and drawing n-k red marbles out of N-K red marbles, in these n rounds.''' 
# right hand side - alternatively, drawing all N marbles out of the urn. We want to find the probability of the outcome of drawing '''k green marbles in n draws out of the total N draws, and K-k green marbles in the rest N-n draws.'''

Back to the calculations, we use the formula above to calculate the probability of drawing exactly ''k'' green marbles 
:<math> P(X=4) = f(4;50,5,10) = {{{5 \choose 4} {{45} \choose {6}}}\over {50 \choose 10}} = {5\cdot 8145060\over 10272278170} = 0.003964583\dots. </math>

Intuitively we would expect it to be even more unlikely that all 5 green marbles will be among the 10 drawn.

:<math> P(X=5) = f(5;50,5,10) = {{{5 \choose 5} {{45} \choose {5}}}\over {50 \choose 10}} = {1\cdot 1221759
\over 10272278170} = 0.0001189375\dots, </math>

As expected, the probability of drawing 5 green marbles is roughly 35 times less likely than that of drawing 4.

=== Symmetries ===
Swapping the roles of green and red marbles:
: <math> f(k;N,K,n) = f(n-k;N,N-K,n)</math>

Swapping the roles of drawn and not drawn marbles:
: <math> f(k;N,K,n) = f(K-k;N,K,N-n)</math>

Swapping the roles of green and drawn marbles:
: <math> f(k;N,K,n) = f(k;N,n,K) </math>

These symmetries generate the [[dihedral group]] <math>D_4</math>.

=== Order of draws ===
The probability of drawing any set of green and red marbles (the hypergeometric distribution) depends only on the  numbers of green and red marbles, not on the order in which they appear; i.e., it is an [[exchangeable random variables|exchangeable]] distribution. As a result, the probability of drawing a green marble in the <math>i^{\text{th}}</math> draw is<ref>{{cite web|url=http://www.stat.yale.edu/~pollard/Courses/600.spring2010/Handouts/Symmetry%5BPolyaUrn%5D.pdf|title=Symmetry|first=David|last=Pollard|publisher=Yale University|work=Stat 330/600 course handouts|date=Spring 2010|access-date=2025-01-19}}</ref>
: <math> P(G_i) = \frac{K}{N}.</math>

This is an ''ex ante'' probability—that is, it is based on not knowing the results of the previous draws.

===Tail bounds===
Let <math>X \sim \operatorname{Hypergeometric}(N,K,n)</math> and <math>p=K/N</math>. Then for <math> 0 < t < K/N</math> we can derive the following bounds:<ref name=":0">{{citation
 | last = Hoeffding | first = Wassily
 | journal = [[Journal of the American Statistical Association]]
 | volume= 58
 | number= 301
 | pages=13–30
 | title = Probability inequalities for sums of bounded random variables
 | year = 1963
 | doi=10.2307/2282952| jstor = 2282952
 | url = http://repository.lib.ncsu.edu/bitstream/1840.4/2170/1/ISMS_1962_326.pdf
 }}.</ref>

: <math>\begin{align}
\Pr[X\le (p - t)n]
&\le e^{-n\text{D}(p-t\parallel p)} \le e^{-2t^2n}\\
\Pr[X\ge (p+t)n]
&\le e^{-n\text{D}(p+t\parallel p)} \le e^{-2t^2n}\\
\end{align}\!</math>

where
: <math> D(a\parallel b)=a\log\frac{a}{b}+(1-a)\log\frac{1-a}{1-b}</math>

is the [[Kullback-Leibler divergence]] and it is used that <math>D(a\parallel b) \ge 2(a-b)^2</math>.<ref name="wordpress.com">{{cite web|url=https://ahlenotes.wordpress.com/2015/12/08/hypergeometric_tail/|title=Another Tail of the Hypergeometric Distribution|date=8 December 2015|website=wordpress.com|access-date=19 March 2018}}</ref>

'''Note''': In order to derive the previous bounds, one has to start by observing that <math>X = \frac{\sum_{i=1}^n Y_i}{n}</math> where <math>Y_i</math> are ''dependent'' random variables with a specific distribution <math>D</math>. Because most of the theorems about bounds in sum of random variables are concerned with ''independent'' sequences of them, one has to first create a sequence <math>Z_i</math> of ''independent'' random variables with the same distribution <math>D</math> and apply the theorems on <math>X' = \frac{\sum_{i=1}^{n}Z_i}{n}</math>. Then, it is proved from Hoeffding <ref name=":0" /> that the results and bounds obtained via this process hold for <math>X</math> as well.

If ''n'' is larger than ''N''/2, it can be useful to apply symmetry to "invert" the bounds, which give you the following:
<ref name="wordpress.com"/>
<ref>{{citation
 | last = Serfling | first = Robert
 | journal = [[The Annals of Statistics]]
 | volume = 2
 | pages = 39–48
 | title = Probability inequalities for the sum in sampling without replacement
 | year = 1974| issue = 1
 | doi = 10.1214/aos/1176342611
 | doi-access = free
 }}.</ref>

: <math>\begin{align}
\Pr[X\le (p - t)n]
&\le e^{-(N-n)\text{D}(p+\tfrac{tn}{N-n}||p)} \le e^{-2 t^2 n \tfrac{n}{N-n}}\\
\\
\Pr[X\ge (p+t)n]
&\le e^{-(N-n)\text{D}(p-\tfrac{tn}{N-n}||p)} \le e^{-2 t^2 n \tfrac{n}{N-n}}\\
\end{align}\!</math>