Perfect hash function
In computer science, a perfect hash function Template:Mvar for a set Template:Mvar is a hash function that maps distinct elements in Template:Mvar to a set of Template:Mvar integers, with no collisions. In mathematical terms, it is an injective function.
Perfect hash functions may be used to implement a lookup table with constant worst-case access time. A perfect hash function can, as any hash function, be used to implement hash tables, with the advantage that no collision resolution has to be implemented. In addition, if the keys are not in the data and if it is known that queried keys will be valid, then the keys do not need to be stored in the lookup table, saving space.
Disadvantages of perfect hash functions are that Template:Mvar needs to be known for the construction of the perfect hash function. Non-dynamic perfect hash functions need to be re-constructed if Template:Mvar changes. For frequently changing Template:Mvar dynamic perfect hash functions may be used at the cost of additional space.<ref name="DynamicPerfectHashing" /> The space requirement to store the perfect hash function is in Template:Math where Template:Math is the number of keys in the structure.
The important performance parameters for perfect hash functions are the evaluation time, which should be constant, the construction time, and the representation size.
ApplicationEdit
A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from Template:Mvar (or other associated values) in a lookup table indexed by the output of the function. One can then test whether a key is present in Template:Mvar, or look up a value associated with that key, by looking for it at its cell of the table. Each such lookup takes constant time in the worst case.<ref name="inventor"/> With perfect hashing, the associated data can be read or written with a single access to the table.<ref>Template:Citation</ref>
Performance of perfect hash functionsEdit
The important performance parameters for perfect hashing are the representation size, the evaluation time, the construction time, and additionally the range requirement <math>\frac{m}{n}</math> (average number of buckets per key in the hash table).<ref name="CHD"/> The evaluation time can be as fast as Template:Math, which is optimal.<ref name="inventor"/><ref name="CHD"/> The construction time needs to be at least Template:Math, because each element in Template:Mvar needs to be considered, and Template:Mvar contains Template:Mvar elements. This lower bound can be achieved in practice.<ref name="CHD"/>
The lower bound for the representation size depends on Template:Mvar and Template:Mvar. Let Template:Math and Template:Mvar a perfect hash function. A good approximation for the lower bound is <math>\log e - \varepsilon \log \frac{1+\varepsilon}{\varepsilon}</math> Bits per element. For minimal perfect hashing, Template:Math, the lower bound is Template:Math bits per element.<ref name="CHD"/>
ConstructionEdit
A perfect hash function for a specific set Template:Mvar that can be evaluated in constant time, and with values in a small range, can be found by a randomized algorithm in a number of operations that is proportional to the size of S. The original construction of Template:Harvtxt uses a two-level scheme to map a set Template:Mvar of Template:Mvar elements to a range of Template:Math indices, and then map each index to a range of hash values. The first level of their construction chooses a large prime Template:Mvar (larger than the size of the universe from which Template:Mvar is drawn), and a parameter Template:Mvar, and maps each element Template:Mvar of Template:Mvar to the index
- <math>g(x)=(kx\bmod p)\bmod n.</math>
If Template:Mvar is chosen randomly, this step is likely to have collisions, but the number of elements Template:Mvar that are simultaneously mapped to the same index Template:Mvar is likely to be small. The second level of their construction assigns disjoint ranges of Template:Math integers to each index Template:Mvar. It uses a second set of linear modular functions, one for each index Template:Mvar, to map each member Template:Mvar of Template:Mvar into the range associated with Template:Math.<ref name="inventor">Template:Citation</ref>
As Template:Harvtxt show, there exists a choice of the parameter Template:Mvar such that the sum of the lengths of the ranges for the Template:Mvar different values of Template:Math is Template:Math. Additionally, for each value of Template:Math, there exists a linear modular function that maps the corresponding subset of Template:Mvar into the range associated with that value. Both Template:Mvar, and the second-level functions for each value of Template:Math, can be found in polynomial time by choosing values randomly until finding one that works.<ref name="inventor"/>
The hash function itself requires storage space Template:Math to store Template:Mvar, Template:Mvar, and all of the second-level linear modular functions. Computing the hash value of a given key Template:Mvar may be performed in constant time by computing Template:Math, looking up the second-level function associated with Template:Math, and applying this function to Template:Mvar. A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps Template:Mvar into a smaller range of length Template:Math.<ref name="inventor"/>
A more recent method for constructing a perfect hash function is described by Template:Harvtxt as "hash, displace, and compress". Here a first-level hash function Template:Mvar is also used to map elements onto a range of Template:Mvar integers. An element Template:Math is stored in the Bucket Template:Mvar.<ref name="CHD" />
Then, in descending order of size, each bucket's elements are hashed by a hash function of a sequence of independent fully random hash functions Template:Math, starting with Template:Math. If the hash function does not produce any collisions for the bucket, and the resulting values are not yet occupied by other elements from other buckets, the function is chosen for that bucket. If not, the next hash function in the sequence is tested.<ref name="CHD" />
To evaluate the perfect hash function Template:Math one only has to save the mapping σ of the bucket index Template:Math onto the correct hash function in the sequence, resulting in Template:Math.<ref name="CHD" />
Finally, to reduce the representation size, the (Template:Math are compressed into a form that still allows the evaluation in Template:Math.<ref name="CHD" />
This approach needs linear time in Template:Mvar for construction, and constant evaluation time. The representation size is in Template:Math, and depends on the achieved range. For example, with Template:Math Template:Harvtxt achieved a representation size between 3.03 bits/key and 1.40 bits/key for their given example set of 10 million entries, with lower values needing a higher computation time. The space lower bound in this scenario is 0.88 bits/key.<ref name="CHD" /> Template:Missing information
PseudocodeEdit
algorithm hash, displace, and compress is (1) Split S into buckets Template:Math (2) Sort buckets Bi in falling order according to size |Bi| (3) Initialize array T[0...m-1] with 0's (4) for all iTemplate:Thin space∈[r], in the order from (2), do (5) for lTemplate:Thin space←Template:Thin space1,2,... (6) repeat forming KiTemplate:Thin space←Template:Thin space{Template:Mathl(x)|xTemplate:Thin space∈Template:Thin spaceBi} (6) until |Ki|=|Bi| and Ki∩{j|T[j]=1}=Template:Thin space∅ (7) let σ(i):= the successful l (8) for all jTemplate:Thin space∈Template:Thin spaceKi let T[j]:=Template:Thin space1 (9) Transform (σi)0≤i<r into compressed form, retaining Template:Math access.
Space lower boundsEdit
The use of Template:Math words of information to store the function of Template:Harvtxt is near-optimal: any perfect hash function that can be calculated in constant time requires at least a number of bits that is proportional to the size of Template:Mvar.<ref>Template:Citation.</ref>
For minimal perfect hash functions the information theoretic space lower bound is
- <math>\log_2e\approx1.44</math>
bits/key.<ref name="CHD" />
For perfect hash functions, it is first assumed that the range of Template:Mvar is bounded by Template:Mvar as Template:Math. With the formula given by Template:Harvtxt and for a universe <math>U\supseteq S</math> whose size Template:Math tends towards infinity, the space lower bounds is
- <math>\log_2e-\varepsilon \log\frac{1+\varepsilon}{\varepsilon}</math>
bits/key, minus Template:Math bits overall.<ref name="CHD" />
ExtensionsEdit
Dynamic perfect hashingEdit
Template:Main article Using a perfect hash function is best in situations where there is a frequently queried large set, Template:Mvar, which is seldom updated. This is because any modification of the set Template:Mvar may cause the hash function to no longer be perfect for the modified set. Solutions which update the hash function any time the set is modified are known as dynamic perfect hashing,<ref name="DynamicPerfectHashing">Template:Citation.</ref> but these methods are relatively complicated to implement.
Minimal perfect hash functionEdit
A minimal perfect hash function is a perfect hash function that maps Template:Mvar keys to Template:Mvar consecutive integers – usually the numbers from Template:Math to Template:Math or from Template:Math to Template:Mvar. A more formal way of expressing this is: Let Template:Mvar and Template:Mvar be elements of some finite set Template:Mvar. Then Template:Mvar is a minimal perfect hash function if and only if Template:Math implies Template:Math (injectivity) and there exists an integer Template:Mvar such that the range of Template:Mvar is Template:Math. It has been proven that a general purpose minimal perfect hash scheme requires at least <math>\log_2 e \approx 1.44</math> bits/key.<ref name="CHD">Template:Citation.</ref> Assuming that <math>S</math> is a set of size <math>n</math> containing integers in the range <math>[1, 2^{o(n)}]</math>, it is known how to efficiently construct an explicit minimal perfect hash function from <math>S</math> to <math>\{1, 2, \ldots, n\}</math> that uses space <math>n \log_2 e + o(n)</math>bits and that supports constant evaluation time.<ref>Template:Citation</ref> In practice, there are minimal perfect hashing schemes that use roughly 1.56 bits/key if given enough time.<ref name="RecSplit">Template:Citation.</ref>
k-perfect hashingEdit
A hash function is Template:Mvar-perfect if at most Template:Mvar elements from Template:Mvar are mapped onto the same value in the range. The "hash, displace, and compress" algorithm can be used to construct Template:Mvar-perfect hash functions by allowing up to Template:Mvar collisions. The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below:
(4) for all iTemplate:Thin space∈[r], in the order from (2), do (5) for lTemplate:Thin space←Template:Thin space1,2,... (6) repeat forming KiTemplate:Thin space←Template:Thin space{Template:Mathl(x)|xTemplate:Thin space∈Template:Thin spaceBi} (6) until |Ki|=|Bi| and Ki∩{j|T[j]=k}=Template:Thin space∅ (7) let σ(i):= the successful l (8) for all jTemplate:Thin space∈Template:Thin spaceKi set T[j]←T[j]+1
Order preservationEdit
A minimal perfect hash function Template:Mvar is order preserving if keys are given in some order Template:Math and for any keys Template:Math and Template:Math, Template:Math implies Template:Math.<ref>Template:Citation</ref> In this case, the function value is just the position of each key in the sorted ordering of all of the keys. A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function to store a lookup table of the positions of each key. This solution uses <math>O(n \log n)</math> bits, which is optimal in the setting where the comparison function for the keys may be arbitrary.<ref>Template:Citation.</ref> However, if the keys Template:Math are integers drawn from a universe <math>\{1, 2, \ldots, U\}</math>, then it is possible to construct an order-preserving hash function using only <math>O(n \log \log \log U)</math> bits of space.<ref>Template:Citation.</ref> Moreover, this bound is known to be optimal.<ref>Template:Citation</ref>
Related constructionsEdit
While well-dimensioned hash tables have amortized average O(1) time (amortized average constant time) for lookups, insertions, and deletion, most hash table algorithms suffer from possible worst-case times that take much longer. A worst-case O(1) time (constant time even in the worst case) would be better for many applications (including network router and memory caches).<ref name="davis" > Timothy A. Davis. "Chapter 5 Hashing": subsection "Hash Tables with Worst-Case O(1) Access" </ref>Template:Rp
Few hash table algorithms support worst-case O(1) lookup time (constant lookup time even in the worst case). The few that do include: perfect hashing; dynamic perfect hashing; cuckoo hashing; hopscotch hashing; and extendible hashing.<ref name="davis" />Template:Rp
A simple alternative to perfect hashing, which also allows dynamic updates, is cuckoo hashing. This scheme maps keys to two or more locations within a range (unlike perfect hashing which maps each key to a single location) but does so in such a way that the keys can be assigned one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time.<ref>Template:Citation.</ref>
ReferencesEdit
Further readingEdit
- Richard J. Cichelli. Minimal Perfect Hash Functions Made Simple, Communications of the ACM, Vol. 23, Number 1, January 1980.
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. MIT Press, 2009. Template:ISBN. Section 11.5: Perfect hashing, pp. 267, 277–282.
- Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani. "Perfect Hashing for Data Management Applications".
- Fabiano C. Botelho and Nivio Ziviani. "External perfect hashing for very large key sets". 16th ACM Conference on Information and Knowledge Management (CIKM07), Lisbon, Portugal, November 2007.
- Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. "Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses". In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), New York, 2009. ACM Press.
- Marshall D. Brain and Alan L. Tharp. "Near-perfect Hashing of Large Word Sets". Software—Practice and Experience, vol. 19(10), 967-078, October 1989. John Wiley & Sons.
- Douglas C. Schmidt, GPERF: A Perfect Hash Function Generator, C++ Report, SIGS, Vol. 10, No. 10, November/December, 1998.
External linksEdit
- gperf is an open source C and C++ perfect hash generator (very fast, but only works for small sets)
- Minimal Perfect Hashing (bob algorithm) by Bob Jenkins
- cmph: C Minimal Perfect Hashing Library, open source implementations for many (minimal) perfect hashes (works for big sets)
- Sux4J: open source monotone minimal perfect hashing in Java
- MPHSharp: perfect hashing methods in C#
- BBHash: minimal perfect hash function in header-only C++
- Perfect::Hash, perfect hash generator in Perl that makes C code. Has a "prior art" section worth looking at.