Editing Index of coincidence (section)

==Calculation==

The index of coincidence provides a measure of how likely it is to draw two matching letters by randomly selecting two letters from a given text. The chance of drawing a given letter in the text is (number of times that letter appears / length of the text). The chance of drawing that same letter again (without replacement) is (appearances − 1 / text length − 1). The product of these two values gives you the chance of drawing that letter twice in a row. One can find this product for each letter that appears in the text, then sum these products to get a chance of drawing two of a kind. This probability can then be normalized by multiplying it by some coefficient, typically 26 in English.

:<math> \mathbf{IC} = c \times \left({\left({\frac{n_\mathrm{a}}{N} \times \frac{n_\mathrm{a} - 1}{N - 1}}\right) + \left({\frac{n_\mathrm{b}}{N} \times \frac{n_\mathrm{b} - 1}{N - 1}}\right) + \cdots + \left({\frac{n_\mathrm{z}}{N} \times \frac{n_\mathrm{z} - 1}{N - 1}}\right)}\right)</math>
where ''c'' is the normalizing coefficient (26 for English), ''n''<sub>a</sub> is the number of times the letter "a" appears in the text, and ''N'' is the length of the text.

We can express the index of coincidence '''IC''' for a given letter-frequency distribution as a summation:

:<math>\mathbf{IC} = \frac{\displaystyle\sum_{i=1}^{c}n_i(n_i -1)}{N(N-1)/c}</math>

where ''N'' is the length of the text and ''n''<sub>1</sub> through ''n<sub>c</sub>'' are the [[Letter frequencies|frequencies]] (as integers) of the ''c'' letters of the alphabet (''c'' = 26 for monocase [[English language|English]]).  The sum of the ''n<sub>i</sub>'' is necessarily ''N''.

The products {{math|''n''(''n'' − 1)}} count the number of [[combinations]] of ''n'' elements taken two at a time.  (Actually this counts each pair twice; the extra factors of 2 occur in both numerator and denominator of the formula and thus cancel out.)  Each of the ''n<sub>i</sub>'' occurrences of the ''i'' -th letter matches each of the remaining {{math|''n<sub>i</sub>'' − 1}} occurrences of the same letter.  There are a total of {{math|''N''(''N'' − 1)}} letter pairs in the entire text, and 1/''c'' is the probability of a match for each pair, assuming a uniform [[random]] distribution of the characters (the "null model"; see below).  Thus, this formula gives the ratio of the total number of coincidences observed to the total number of coincidences that one would expect from the null model.<ref>{{cite journal |last=Mountjoy |first=Marjorie | title= The Bar Statistics | journal=NSA Technical Journal | year=1963 | volume=VII | issue=2,4}} Published in two parts.</ref>

The expected average value for the IC can be computed from the relative letter frequencies {{mvar|''f<sub>i</sub>''}} of the source language:

:<math>\mathbf{IC}_{\mathrm{expected}} = \frac{\displaystyle\sum_{i=1}^{c}{f_i}^2}{1/c}.</math>

If all {{mvar|c}} letters of an alphabet were equally probable, the expected index would be 1.0.
The actual monographic IC for [[telegraph]]ic English text is around 1.73, reflecting the unevenness of [[natural language|natural-language]] letter distributions.

Sometimes values are reported without the normalizing denominator, for example {{math|1=0.067 = 1.73/26}} for English; such values may be called ''κ''<sub>p</sub> ("kappa-plaintext") rather than IC, with ''κ''<sub>r</sub> ("kappa-random") used to denote the denominator {{math|1/''c''}} (which is the expected coincidence rate for a uniform distribution of the same alphabet, {{math|1=0.0385=1/26}} for English). English plaintext will generally fall somewhere in the range of 1.5 to 2.0 (normalized calculation).<ref>{{Cite journal |last=Kontou |first=Eleni |date=18 May 2020 |title=Index of Coincidence |url=https://core.ac.uk/display/327259203 |journal=University of Leicester Open Journals |via=[[CORE_(research_service)|CORE]]}}</ref>