Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Hash function
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Hashing integer data types== There are several common algorithms for hashing integers. The method giving the best distribution is data-dependent. One of the simplest and most common methods in practice is the modulo division method. === Identity hash function === If the data to be hashed is small enough, then one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this ''[[identity function|identity]]'' hash function is effectively zero. This hash function is [[Perfect hash function|perfect]], as it maps each input to a distinct hash value. The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in [[Java (programming language)|Java]], the hash code is a 32-bit integer. Thus the 32-bit integer <code>Integer</code> and 32-bit floating-point <code>Float</code> objects can simply use the value directly, whereas the 64-bit integer <code>Long</code> and 64-bit floating-point <code>Double</code> cannot. Other types of data can also use this hashing scheme. For example, when mapping [[character string]]s between [[letter case|upper and lower case]], one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in [[extended ASCII]]<ref group=Notes>Plain [[ASCII]] is a 7-bit character encoding, although it is often stored in 8-bit bytes with the highest-order bit always clear (zero). Therefore, for plain ASCII, the bytes have only 2<sup>7</sup> = 128 valid values, and the character translation table has only this many entries.</ref> or [[ISO Latin 1]]), the table has only 2<sup>8</sup> = 256 entries; in the case of [[Unicode]] characters, the table would have 17 Γ 2<sup>16</sup> = {{Val|1114112}} entries. The same technique can be used to map [[ISO 3166-1 alpha-2|two-letter country codes]] like "us" or "za" to country names (26<sup>2</sup> = 676 table entries), 5-digit [[ZIP Code|ZIP codes]] like 13083 to city names ({{val|100000}} entries), etc. Invalid data values (such as the country code "xx" or the ZIP code 00000) may be left undefined in the table or mapped to some appropriate "null" value. === Trivial hash function === If the keys are uniformly or sufficiently uniformly distributed over the key space, so that the key values are essentially random, then they may be considered to be already "hashed". In this case, any number of any bits in the key may be extracted and collated as an index into the hash table. For example, a simple hash function might mask off the {{Mvar|m}} least significant bits and use the result as an index into a hash table of size {{Math|2<sup>''m''</sup>}}. === Mid-squares === A mid-squares hash code is produced by squaring the input and extracting an appropriate number of middle digits or bits. For example, if the input is {{Val|123456789}} and the hash table size {{Val|10000}}, then squaring the key produces {{Val|15241578750190521}}, so the hash code is taken as the middle 4 digits of the 17-digit number (ignoring the high digit) 8750. The mid-squares method produces a reasonable hash code if there is not a lot of leading or trailing zeros in the key. This is a variant of multiplicative hashing, but not as good because an arbitrary key is not a good multiplier. === Division hashing === A standard technique is to use a modulo function on the key, by selecting a divisor {{Mvar|M}} which is a prime number close to the table size, so {{Math|''h''(''K'') ≡ ''K'' (mod ''M'')}}. The table size is usually a power of 2. This gives a distribution from {{Math|{{mset|0, ''M'' − 1}}}}. This gives good results over a large number of key sets. A significant drawback of division hashing is that division requires multiple cycles on most modern architectures (including [[x86]]) and can be 10 times slower than multiplication. A second drawback is that it will not break up clustered keys. For example, the keys 123000, 456000, 789000, etc. modulo 1000 all map to the same address. This technique works well in practice because many key sets are sufficiently random already, and the probability that a key set will be cyclical by a large prime number is small. === Algebraic coding === Algebraic coding is a variant of the division method of hashing which uses division by a polynomial modulo 2 instead of an integer to map {{Mvar|n}} bits to {{Mvar|m}} bits.<ref name="knuth-1973" />{{rp|pages=512β513}} In this approach, {{Math|1=''M'' = 2<sup>''m''</sup>}}, and we postulate an {{Mvar|m}}th-degree polynomial {{Math|1=''Z''(''x'') = ''x''<sup>''m''</sup> + ζ<sub>''m''−1</sub>''x''<sup>''m''−1</sup> + ⋯ + ζ<sub>0</sub>}}. A key {{Math|1=''K'' = (''k''<sub>''n''−1</sub>…''k''<sub>1</sub>''k''<sub>0</sub>)<sub>2</sub>}} can be regarded as the polynomial {{Math|1=''K''(''x'') = ''k''<sub>''n''−1</sub>''x''<sup>''n''−1</sup> + ⋯ + ''k''<sub>1</sub>''x'' + ''k''<sub>0</sub>}}. The remainder using polynomial arithmetic modulo 2 is {{Math|1=''K''(''x'') mod ''Z''(''x'') = ''h''<sub>''m''−1</sub>''x''<sup>''m''−1</sup> + ⋯ ''h''<sub>1</sub>''x'' + ''h''<sub>0</sub>}}. Then {{Math|1=''h''(''K'') = (''h''<sub>''m''−1</sub>…''h''<sub>1</sub>''h''<sub>0</sub>)<sub>2</sub>}}. If {{Math|''Z''(''x'')}} is constructed to have {{Mvar|t}} or fewer non-zero coefficients, then keys which share fewer than {{Mvar|t}} bits are guaranteed to not collide. {{Mvar|Z}} is a function of {{Mvar|k}}, {{Mvar|t}}, and {{Mvar|n}} (the last of which is a divisor of {{Math|2<sup>''k''</sup> − 1}}) and is constructed from the [[finite field]] {{Math|GF(2<sup>''k''</sup>)}}. [[Donald Knuth|Knuth]] gives an example: taking {{Math|1=(''n'',''m'',''t'') = (15,10,7)}} yields {{Math|1=''Z''(''x'') = ''x''<sup>10</sup> + ''x''<sup>8</sup> + ''x''<sup>5</sup> + ''x''<sup>4</sup> + ''x''<sup>2</sup> + ''x'' + 1}}. The derivation is as follows: Let {{Mvar|S}} be the smallest set of integers such that {{Math|{{mset|1,2,…,''t''}} ⊆ ''S''}} and {{Math|(2''j'' mod ''n'') ∈ ''S'' ∀''j'' ∈ ''S''}}.<ref group="Notes">For example, for n=15, k=4, t=6, <math>S=\{1,2,3,4,5,6,8,10,12,9\}</math> [Knuth]</ref> Define <math>P(x)=\prod_{j\in S}(x-\alpha^j)</math> where {{Math|α ∈<sup>''n''</sup> GF(2<sup>''k''</sup>)}} and where the coefficients of {{Math|''P''(''x'')}} are computed in this field. Then the degree of {{Math|1=''P''(''x'') = {{abs|''S''}}}}. Since {{Math|α<sup>2''j''</sup>}} is a root of {{Math|''P''(''x'')}} whenever {{Math|α<sup>''j''</sup>}} is a root, it follows that the coefficients {{Math|''p<sup>i</sup>''}} of {{Math|''P''(''x'')}} satisfy {{Mvar|1=''p''{{su|p=2|b=''i''}} = ''p''<sub>i</sub>}}, so they are all 0 or 1. If {{Math|1=''R''(''x'') = ''r''<sub>''n''−1</sub>''x''<sup>''n''−1</sup> + ⋯ + ''r''<sub>1</sub>''x'' + ''r''<sub>0</sub>}} is any nonzero polynomial modulo 2 with at most {{Mvar|t}} nonzero coefficients, then {{Math|''R''(''x'')}} is not a multiple of {{Math|''P''(''x'')}} modulo 2.<ref group=Notes>Knuth conveniently leaves the proof of this to the reader.</ref> If follows that the corresponding hash function will map keys with fewer than {{Mvar|t}} bits in common to unique indices.<ref name="knuth-1973" />{{rp|pages=542β543}} The usual outcome is that either {{Mvar|n}} will get large, or {{Mvar|t}} will get large, or both, for the scheme to be computationally feasible. Therefore, it is more suited to hardware or microcode implementation.<ref name="knuth-1973" />{{rp|pages=542β543}} === Unique permutation hashing === Unique permutation hashing has a guaranteed best worst-case insertion time.<ref>{{cite journal|doi= 10.1016/j.tcs.2012.12.047|title=Unique permutation hashing|year=2013|doi-access=free|last1=Dolev|first1=Shlomi|last2=Lahiani|first2=Limor|last3=Haviv|first3=Yinnon|journal=Theoretical Computer Science|volume=475|pages=59β65}}</ref> === Multiplicative hashing === Standard multiplicative hashing uses the formula {{Math|1=''h''<sub>''a''</sub>(''K'') = {{floor|(''aK'' mod ''W'') / (''W''/''M'')}}}}, which produces a hash value in {{Math|{{mset|0, …, ''M'' − 1}}}}. The value {{Mvar|a}} is an appropriately chosen value that should be [[Coprime integers|relatively prime]] to {{Mvar|W}}; it should be large,{{Clarify|reason=How large?|date=January 2021}} and its binary representation a random mix{{Clarify|reason="a" is a constant, so it cannot be random|date=January 2021}} of 1s and 0s. An important practical special case occurs when {{Math|1=''W'' = 2<sup>''w''</sup>}} and {{Math|1=''M'' = 2<sup>''m''</sup>}} are powers of 2 and {{Mvar|w}} is the machine [[word size]]. In this case, this formula becomes {{Math|1=''h''<sub>''a''</sub>(''K'') = {{floor|(''aK'' mod 2<sup>''w''</sup>) / 2<sup>''w''−''m''</sup>}}}}. This is special because arithmetic modulo {{Math|2<sup>''w''</sup>}} is done by default in low-level programming languages and integer division by a power of 2 is simply a right-shift, so, in [[C (programming language)|C]], for example, this function becomes <syntaxhighlight lang="c"> unsigned hash(unsigned K) { return (a * K) >> (w - m); } </syntaxhighlight> and for fixed {{Mvar|m}} and {{Mvar|w}} this translates into a single integer multiplication and right-shift, making it one of the fastest hash functions to compute. Multiplicative hashing is susceptible to a "common mistake" that leads to poor diffusion—higher-value input bits do not affect lower-value output bits.<ref> {{cite web |url=http://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html |title=CS 3110 Lecture 21: Hash functions |at=Section "Multiplicative hashing"}}</ref> A transmutation on the input which shifts the span of retained top bits down and XORs or ADDs them to the key before the multiplication step corrects for this. The resulting function looks like:<ref name="fibonacci-hashing"/> <syntaxhighlight lang="c"> unsigned hash(unsigned K) { K ^= K >> (w - m); return (a * K) >> (w - m); } </syntaxhighlight> === Fibonacci hashing === [[Fibonacci number|Fibonacci]] hashing is a form of multiplicative hashing in which the multiplier is {{Mvar|2<sup>''w''</sup> / ϕ}}, where {{Mvar|w}} is the machine word length and {{Mvar|ϕ}} (phi) is the [[golden ratio]] (approximately 1.618). A property of this multiplier is that it uniformly distributes over the table space, [[Blockchain|blocks]] of consecutive keys with respect to any block of bits in the key. Consecutive keys within the high bits or low bits of the key (or some other field) are relatively common. The multipliers for various word lengths are: *16: ''a'' = <span style="display: inline-block; width: 11em">9E37<sub>16</sub></span> = {{val|40503}}<sub>10</sub> *32: ''a'' = <span style="display: inline-block; width: 11em">{{gaps|9E37|79B9}}<sub>16</sub></span> = {{val|2654435769}}<sub>10</sub> *48: ''a'' = <span style="display: inline-block; width: 11em">{{gaps|9E37|79B9|7F4B}}<sub>16</sub></span> = {{val|173961102589771}}<sub>10</sub><ref group=Notes>Unisys large systems.</ref> *64: ''a'' = <span style="display: inline-block; width: 11em">{{gaps|9E37|79B9|7F4A|7C15}}<sub>16</sub></span> = {{val|11400714819323198485}}<sub>10</sub> {{^|256+ bits is 0x9E3779B97F4A7C15F39CC0605CEDC8341082276BF3A27251F86C6A11D0C18E95.2767F..., if anyone needs it}} The multiplier should be odd, so the least significant bit of the output is invertible modulo {{Math|2<sup>''w''</sup>}}. The last two values given above are rounded (up and down, respectively) by more than 1/2 of a least-significant bit to achieve this. === Zobrist hashing === {{main | Tabulation hashing|Zobrist hashing}} Tabulation hashing, more generally known as ''Zobrist hashing'' after [[Albert Lindsey Zobrist|Albert Zobrist]], is a method for constructing universal families of hash functions by combining table lookup with XOR operations. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer-number keys).<ref>{{citation|first=Albert L.|last=Zobrist|author-link= Albert Lindsey Zobrist|url=https://www.cs.wisc.edu/techreports/1970/TR88.pdf|title=A New Hashing Method with Application for Game Playing|series=Tech. Rep. 88|publisher=Computer Sciences Department, University of Wisconsin|location=Madison, Wisconsin|date=April 1970}}.</ref> Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64Γ12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation) and XORing them together (the starting value could be 0 (the identity value for XOR) or a random seed). The resulting value was reduced by modulo, folding, or some other operation to produce a hash table index. The original Zobrist hash was stored in the table as the representation of the position. Later, the method was extended to hashing integers by representing each byte in each of 4 possible positions in the word by a unique 32-bit random number. Thus, a table of 2<sup>8</sup>Γ4 random numbers is constructed. A 32-bit hashed integer is transcribed by successively indexing the table with the value of each byte of the plain text integer and XORing the loaded values together (again, the starting value can be the identity value or a random seed). The natural extension to 64-bit integers is by use of a table of 2<sup>8</sup>Γ8 64-bit random numbers. This kind of function has some nice theoretical properties, one of which is called ''3-tuple independence'', meaning that every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values. === Customized hash function === A hash function can be designed to exploit existing entropy in the keys. If the keys have leading or trailing zeros, or particular fields that are unused, always zero or some other constant, or generally vary little, then masking out only the volatile bits and hashing on those will provide a better and possibly faster hash function. Selected divisors or multipliers in the division and multiplicative schemes may make more uniform hash functions if the keys are cyclic or have other redundancies.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)