Editing Hash table (section)

==Hash function==

A [[hash function]] <math>h : U \rightarrow \{0, ..., m-1\}</math> maps the universe <math>U</math> of keys to indices or slots within the table, that is, <math>h(x) \in \{0, ..., m-1\}</math> for <math>x \in U</math>. The conventional implementations of hash functions are based on the ''integer universe assumption'' that all elements of the table stem from the universe <math>U = \{0, ..., u - 1\}</math>, where the [[bit length]] of <math>u</math> is confined within the [[Word (computer architecture)|word size]] of a [[computer architecture]].{{r|hashhist|p=2}}

A hash function <math>h</math> is said to be [[perfect hash function|perfect]] for a given set <math>S</math> if it is [[injective function|injective]] on <math>S</math>, that is, if each element <math>x \in S</math> maps to a different value in <math>{0, ..., m-1}</math>.<ref name="Yi06">{{cite conference | last1 = Lu | first1 = Yi | last2 = Prabhakar | first2 = Balaji | last3 = Bonomi | first3 = Flavio | doi = 10.1109/ISIT.2006.261567 | conference = 2006 IEEE International Symposium on Information Theory | pages = 2774–2778 | title = Perfect Hashing for Network Applications | year = 2006| isbn = 1-4244-0505-X | s2cid = 1494710 }}</ref><ref name="CHD">{{cite conference | last1 = Belazzougui | first1 = Djamal | last2 = Botelho | first2 = Fabiano C. | last3 = Dietzfelbinger | first3 = Martin | title = Hash, displace, and compress | url = http://cmph.sourceforge.net/papers/esa09.pdf | doi = 10.1007/978-3-642-04128-0_61 | location = Berlin | mr = 2557794 | pages = 682–693 | publisher = Springer | series = [[Lecture Notes in Computer Science]] | book-title = Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7–9, 2009, Proceedings | volume = 5757 | year = 2009| citeseerx = 10.1.1.568.130}}</ref> A perfect hash function can be created if all the keys are known ahead of time.<ref name="Yi06" />

=== Integer universe assumption ===
The schemes of hashing used in ''integer universe assumption'' include hashing by division, hashing by multiplication, [[universal hashing]], [[dynamic perfect hashing]], and [[Static Hashing|static perfect hashing]].{{r|hashhist|p=2}} However, hashing by division is the commonly used scheme.{{r|cormenalgo01|p=264}}<ref name="owo03">{{cite journal |last1=Owolabi |first1=Olumide |title=Empirical studies of some hashing functions |journal=Information and Software Technology |date=February 2003 |volume=45 |issue=2 |pages=109–112 |doi=10.1016/S0950-5849(02)00174-X }}</ref>{{rp|p=110}}

==== Hashing by division ====
The scheme in hashing by division is as follows:{{r|hashhist|p=2}}
<math display="block">h(x)\ =\ x\, \bmod\, m,</math>
where <math>h(x)</math> is the hash value of <math>x \in S</math> and <math>m</math> is the size of the table.

==== Hashing by multiplication ====
The scheme in hashing by multiplication is as follows:{{r|hashhist|pp=2-3}}
<math display="block">h(x) = \lfloor m \bigl((xA) \bmod 1\bigr) \rfloor</math>
Where <math>A</math> is a non-integer [[Real number|real-valued constant]] and <math>m</math> is the size of the table. An advantage of the hashing by multiplication is that the <math>m</math> is not critical.{{r|hashhist|pp=2-3}} Although any value <math>A</math> produces a hash function, [[Donald Knuth]] suggests using the [[golden ratio]].{{r|hashhist|p=3}}

===Choosing a hash function===

[[Uniform distribution (discrete)|Uniform distribution]] of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a [[Pearson's chi-squared test#Discrete uniform distribution|Pearson's chi-squared test]] for discrete uniform distributions.<ref name="chernoff">{{Cite journal | first=Karl |last=Pearson |author1-link=Karl Pearson | year = 1900 | title = On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling | journal = Philosophical Magazine |series=Series 5 | volume = 50 | number = 302 | pages = 157–175 | doi=10.1080/14786440009463897 |url=https://zenodo.org/record/1430618 }}</ref><ref name="plackett">{{Cite journal |first=Robin |last=Plackett |author1-link=Robin Plackett | year = 1983 | title =  Karl Pearson and the Chi-Squared Test | journal = International Statistical Review | volume = 51 | number = 1 | pages = 59–72 | doi=10.2307/1402731 |jstor=1402731 }}</ref>
	
The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a [[power of two]]. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a [[prime number]].<ref name=":0">{{Cite web|title = Prime Double Hash Table|url = https://www.concentric.net/~Ttwang/tech/primehash.htm|date = March 1997|access-date = 2015-05-10|last = Wang|first = Thomas|archive-url = https://web.archive.org/web/19990903133921/http://www.concentric.net/~Ttwang/tech/primehash.htm|archive-date = 1999-09-03|url-status=dead}}</ref>

For [[open addressing]] schemes, the hash function should also avoid ''[[Primary clustering|clustering]]'', the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.<ref name=":0" /><ref name="knuth"/>
	
[[K-independent hashing]] offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.<ref>{{cite journal |last1=Wegman |first1=Mark N. |last2=Carter |first2=J.Lawrence |title=New hash functions and their use in authentication and set equality |journal=Journal of Computer and System Sciences |date=June 1981 |volume=22 |issue=3 |pages=265–279 |doi=10.1016/0022-0000(81)90033-7 |doi-access=free }}</ref>