Editing Birthday attack (section)

==Mathematics==
Given a function <math>f</math>, the goal of the attack is to find two different inputs <math>x_{1}, x_{2}</math> such that <math>f(x_{1}) = f(x_{2})</math>. Such a pair <math>x_{1}, x_{2}</math> is called a collision. The method used to find a collision is simply to evaluate the function <math>f</math> for different input values that may be chosen randomly or pseudorandomly until the same result is found more than once. Because of the birthday problem, this method can be rather efficient. Specifically, if a [[function (mathematics)|function]] <math>f(x)</math> yields any of <math>H</math> different outputs with equal probability and <math>H</math> is sufficiently large, then we expect to obtain a pair of different arguments <math>x_{1}</math> and <math>x_{2}</math> with <math>f(x_{1}) = f(x_{2})</math> after evaluating the function for about <math>1.25\sqrt{H}</math> different arguments on average.

We consider the following experiment. From a set of ''H'' values we choose ''n'' values uniformly at random thereby allowing repetitions.  Let ''p''(''n'';&nbsp;''H'') be the probability that during this experiment at least one value is chosen more than once. This probability can be approximated as 
: <math> p(n;H) \approx 1 - e^{-n(n-1)/(2H)} \approx 1-e^{-n^2/(2H)}</math><ref>{{Cite book |last1=Bellare |first1=Mihir |url=https://web.cs.ucdavis.edu/~rogaway/classes/227/spring05/book/main.pdf |title=Introduction to Modern Cryptography |last2=Rogaway |first2=Phillip |year=2005 |pages=273–274 |language=en |chapter=The Birthday Problem |access-date=2023-03-31}}</ref>
where <math>n</math> is the number of chosen values (inputs) and <math>H</math> is the number of possible outcomes (possible hash outputs).

Let ''n''(''p'';&nbsp;''H'') be the smallest number of values we have to choose, such that the probability for finding a collision is at least&nbsp;''p''.  By inverting this expression above, we find the following approximation

: <math>n(p;H)\approx \sqrt{2H\ln\frac{1}{1-p}}</math>

and assigning a 0.5 probability of collision we arrive at

: <math>n(0.5;H) \approx 1.1774 \sqrt H</math>

Let ''Q''(''H'') be the expected number of values we have to choose before finding the first collision. This number can be approximated by

: <math>Q(H)\approx \sqrt{\frac{\pi}{2}H}</math>

As an example, if a 64-bit hash is used, there are approximately {{val|1.8|e=19}} different outputs. If these are all equally probable (the best case), then it would take 'only' approximately 5 billion attempts ({{val|5.38|e=9}}) to generate a collision using brute force.<ref>{{Cite book|last1=Flajolet|first1=Philippe|last2=Odlyzko|first2=Andrew M.|title=Advances in Cryptology — EUROCRYPT '89 |chapter=Random Mapping Statistics |date=1990|editor-last=Quisquater|editor-first=Jean-Jacques|editor2-last=Vandewalle|editor2-first=Joos|chapter-url=https://link.springer.com/chapter/10.1007%2F3-540-46885-4_34|series=Lecture Notes in Computer Science|volume=434|language=en|location=Berlin, Heidelberg|publisher=Springer|pages=329–354|doi=10.1007/3-540-46885-4_34|isbn=978-3-540-46885-1}}</ref> This value is called '''birthday bound'''<ref>See [[upper and lower bounds]].</ref> and it could be approximated as 2<sup>''l''/2</sup>, where ''l'' is the number of bits in H.<ref>{{Cite journal
  | author = Jacques Patarin, Audrey Montreuil
  | title = Benes and Butterfly schemes revisited
  | publisher = Université de Versailles
  | year = 2005
  | url = http://eprint.iacr.org/2005/004
  | format = [[PostScript]], [[PDF]]
  | access-date = 2007-03-15 }}
<!-- Replace with a better definition of the birthday bound if you find some please. -->
</ref> Other examples are as follows:
<!-- If this table is made any bigger it will cause horizontal scroll on 1024x768 screens -->
:{| class="wikitable" style="white-space:nowrap; text-align:center;"
|-
! rowspan="2"| Bits
! rowspan="2"| Possible outputs (H)
! colspan="10"| Desired probability of random collision<br>(2 s.f.) (p)
|-
! {{val|e=-18}}
! {{val|e=−15}}
! {{val|e=−12}}
! {{val|e=−9}}
! {{val|e=−6}}
! 0.1%
! 1%
! 25%
! 50%
! 75%
|-
!scope="row"| 16
!scope="row"| 2<sup>16</sup> (~6.5 x 10<sup>4</sup>)
| <2
| <2
| <2
| <2
| <2
| 11
| 36
| 190
| 300
| 430
|-
!scope="row"| 32
!scope="row"| 2<sup>32</sup> (~{{val|4.3|e=9}})
| <2
| <2
| <2
| 3
| 93
| 2900
| 9300
| 50,000
| 77,000
| 110,000
|-
!scope="row"| 64
!scope="row"| 2<sup>64</sup> (~{{val|1.8|e=19}})
| 6
| 190
| 6100
| 190,000
| 6,100,000
| {{val|1.9|e=8}}
| {{val|6.1|e=8}}
| {{val|3.3|e=9}}
| {{val|5.1|e=9}}
| {{val|7.2|e=9}}
|-
!scope="row"| 96
!scope="row"| 2<sup>96</sup> (~{{val|7.9|e=28}})
| {{val|4.0|e=5}}
| {{val|1.3|e=7}}
| {{val|4.0|e=8}}
| {{val|1.3|e=10}}
| {{val|4.0|e=11}}
| {{val|1.3|e=13}}
| {{val|4.0|e=13}}
| {{val|2.1|e=14}}
| {{val|3.3|e=14}}
| {{val|4.7|e=14}}
|-
!scope="row"| 128
!scope="row"| 2<sup>128</sup> (~{{val|3.4|e=38}})
| {{val|2.6|e=10}}
| {{val|8.2|e=11}}
| {{val|2.6|e=13}}
| {{val|8.2|e=14}}
| {{val|2.6|e=16}}
| {{val|8.3|e=17}}
| {{val|2.6|e=18}}
| {{val|1.4|e=19}}
| {{val|2.2|e=19}}
| {{val|3.1|e=19}}
|-
!scope="row"| 192
!scope="row"| 2<sup>192</sup> (~{{val|6.3|e=57}})
| {{val|1.1|e=20}}
| {{val|3.7|e=21}}
| {{val|1.1|e=23}}
| {{val|3.5|e=24}}
| {{val|1.1|e=26}}
| {{val|3.5|e=27}}
| {{val|1.1|e=28}}
| {{val|6.0|e=28}}
| {{val|9.3|e=28}}
| {{val|1.3|e=29}}
|-
!scope="row"| 256
!scope="row"| 2<sup>256</sup> (~{{val|1.2|e=77}})
| {{val|4.8|e=29}}
| {{val|1.5|e=31}}
| {{val|4.8|e=32}}
| {{val|1.5|e=34}}
| {{val|4.8|e=35}}
| {{val|1.5|e=37}}
| {{val|4.8|e=37}}
| {{val|2.6|e=38}}
| {{val|4.0|e=38}}
| {{val|5.7|e=38}}
|-
!scope="row"| 384
!scope="row"| 2<sup>384</sup> (~{{val|3.9|e=115}})
| {{val|8.9|e=48}}
| {{val|2.8|e=50}}
| {{val|8.9|e=51}}
| {{val|2.8|e=53}}
| {{val|8.9|e=54}}
| {{val|2.8|e=56}}
| {{val|8.9|e=56}}
| {{val|4.8|e=57}}
| {{val|7.4|e=57}}
| {{val|1.0|e=58}}
|-
!scope="row"| 512
!scope="row"| 2<sup>512</sup> (~{{val|1.3|e=154}})
| {{val|1.6|e=68}}
| {{val|5.2|e=69}}
| {{val|1.6|e=71}}
| {{val|5.2|e=72}}
| {{val|1.6|e=74}}
| {{val|5.2|e=75}}
| {{val|1.6|e=76}}
| {{val|8.8|e=76}}
| {{val|1.4|e=77}}
| {{val|1.9|e=77}}
|}
:''Table shows number of hashes n''(''p'')'' needed to achieve the given probability of success, assuming all hashes are equally likely. For comparison, ''{{val|e=−18}}'' to ''{{val|e=−15}}'' is the uncorrectable bit error rate of a typical hard disk.<ref>{{cite arXiv|title=Empirical Measurements of Disk Failure Rates and Error Rates|first1=Jim|last1=Gray|first2=Catharine|last2=van Ingen|date=25 January 2007|eprint=cs/0701166}}</ref> In theory, [[MD5]] hashes or [[Universally unique identifier|UUIDs]], being roughly 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.''

It is easy to see that if the outputs of the function are distributed unevenly, then a collision could be found even faster.  The notion of 'balance' of a hash function quantifies the resistance of the function to birthday attacks (exploiting uneven key distribution.) However, determining the balance of a hash function will typically require all possible inputs to be calculated and thus is infeasible for popular hash functions such as the MD and SHA families.<ref>{{Cite web |url=http://citeseer.ist.psu.edu/bellare02hash.html |title=CiteSeerX |access-date=2006-05-02 |archive-url=https://web.archive.org/web/20080223163847/http://citeseer.ist.psu.edu/bellare02hash.html |archive-date=2008-02-23 |url-status=dead }}</ref>
The subexpression <math>\ln\frac{1}{1-p}</math> in the equation for <math>n(p;H)</math> is not computed accurately for small <math>p</math> when directly translated into common programming languages as <code>log(1/(1-p))</code> due to [[loss of significance]].  When <code>log1p</code> is available (as it is in [[C99]]) for example, the equivalent expression <code>-log1p(-p)</code> should be used instead.<ref>{{cite web|title=Compute log(1+x) accurately for small values of x|url=http://www.mathworks.com/help/techdoc/ref/log1p.html|website=Mathworks.com|access-date=29 October 2017}}</ref>  If this is not done, the first column of the above table is computed as zero, and several items in the second column do not have even one correct significant digit.

===Simple approximation===
A good [[rule of thumb]] which can be used for [[mental calculation]] is the relation

:<math>p(n) \approx {n^2 \over 2H}</math>

which can also be written as

:<math>H \approx {n^2 \over 2p(n)}</math>.

or

:<math>n \approx \sqrt { 2H \times p(n)}</math>.

This works well for probabilities less than or equal to 0.5.

This approximation scheme is especially easy to use when working with exponents. For instance, suppose you are building 32-bit hashes (<math>H = 2^{32}</math>) and want the chance of a collision to be at most one in a million (<math> p \approx 2^{-20} </math>), how many documents could we have at the most?

:<math>n \approx \sqrt { 2 \times 2^{32} \times 2^{-20}} = \sqrt { 2^{1+32-20} } = \sqrt { 2^{13} } = 2^{6.5} \approx 90.5 </math>

which is close to the correct answer of 93.