Editing Double hashing

{{Short description|Computer programming technique}}
'''Double hashing''' is a [[computer programming]] technique used in conjunction with [[open addressing]] in [[hash table]]s to resolve [[hash collision]]s, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table <math>T</math>.

The double hashing technique uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent [[hash function]]. Unlike the alternative collision-resolution methods of [[linear probing]] and [[quadratic probing]], the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of [[Primary_clustering|clustering]].

Given two random, uniform, and independent hash functions <math>h_1</math> and <math>h_2</math>, the <math>i</math>th location in the bucket sequence for value <math>k</math> in a hash table of <math>|T|</math> buckets is: <math>h(i,k)=(h_1(k) + i \cdot h_2(k))\bmod|T|.</math>
Generally, <math>h_1</math> and <math>h_2</math> are selected from a set of [[universal hash]] functions; <math>h_1</math> is selected to have a range of <math>\{0,|T|-1\}</math> and <math>h_2</math> to have a range of <math>\{1,|T|-1\}</math>. Double hashing approximates a random distribution; more precisely, pair-wise independent hash functions yield a probability of <math>(n/|T|)^2</math> that any pair of keys will follow the same bucket sequence.

== Selection of h<sub>2</sub>(k)==
The secondary hash function <math>h_2(k)</math> should have several characteristics:
* It should never yield an index of zero.
* It should cycle through the whole table.
* It should be very fast to compute.
* It should be pair-wise independent of <math>h_1(k)</math>.
* The distribution characteristics of <math>h_2</math> are irrelevant. It is analogous to a random-number generator.
* All <math>h_2(k)</math> should be ''relatively prime'' to |''T''|.

In practice:
* If division hashing is used for both functions, the divisors are chosen as primes.
* If |''T''| is a power of 2, the first and last requirements are usually satisfied by making <math>h_2(k)</math> always return an odd number. This has the side effect of doubling the chance of collision due to one wasted bit.<ref name=Dillinger04/>

== Analysis ==

Let <math>n</math> be the number of elements stored in <math>T</math>, then <math>T</math>'s load factor is <math>\alpha = n/|T|</math>.  That is, start by randomly, uniformly and independently selecting two [[universal hash]] functions <math>h_1</math> and <math>h_2</math> to build a double hashing table <math>T</math>. All elements are put in <math>T</math> by double hashing using <math>h_1</math> and <math>h_2</math>.
Given a key <math>k</math>, the <math>(i+1)</math>-st hash location is computed by:

<math display=block> h(i,k) = ( h_1(k) + i \cdot h_2(k) ) \bmod |T|.</math>

Let <math>T</math> have fixed load factor <math>\alpha: 1 > \alpha > 0</math>.
Bradford and [[Michael N. Katehakis|Katehakis]]<ref>{{citation
 | last1 = Bradford | first1 = Phillip G.
 | last2 = Katehakis | first2 = Michael N. | author2-link = Michael N. Katehakis
 | doi = 10.1137/S009753970444630X
 | journal = SIAM Journal on Computing
 | volume = 37
 | issue = 1
 | pages = 83–111
 | date = April 2007
 | mr = 2306284
 | title = A Probabilistic Study on Combinatorial Expanders and Hashing
 | url = http://phillipbradford.com/papers/AProbStudyExpandersAndHashing.pdf
 | archive-url = https://web.archive.org/web/20160125172602/http://phillipbradford.com/papers/AProbStudyExpandersAndHashing.pdf
 | archive-date = 2016-01-25
}}.</ref>
showed the expected number of probes for an unsuccessful search in <math>T</math>, still using these initially chosen hash functions, is <math>\tfrac{1}{1-\alpha}</math> regardless of the distribution of the inputs. Pair-wise independence of the hash functions suffices.

Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The usual heuristic is to limit the table loading to 75% of capacity.  Eventually, rehashing to a larger size will be necessary, as with all other open addressing schemes.

== Variants ==

Peter Dillinger's PhD thesis<ref name=Dillinger10>{{cite thesis
 |title=Adaptive Approximate State Storage
 |first=Peter C. |last=Dillinger
 |date=December 2010
 |publisher=Northeastern University
 |type=PhD thesis
 |pages=93–112
 |url=http://peterd.org/pcd-diss.pdf#page=93
}}</ref> points out that double hashing produces unwanted equivalent hash functions when the hash functions are treated as a set, as in [[Bloom filter]]s: If <math>h_2(y) = -h_2(x)</math> and <math>h_1(y) = h_1(x) + k\cdot h_2(x)</math>, then <math>h(i, y) = h(k - i, x)</math> and the sets of hashes <math>\left\{h(0, x), ..., h(k, x)\right\} = \left\{h(0, y), ..., h(k, y)\right\}</math>  are identical.  This makes a collision twice as likely as the hoped-for <math>1/|T|^2</math>.

There are additionally a significant number of mostly-overlapping hash sets; if <math>h_2(y) = h_2(x)</math> and <math>h_1(y) = h_1(x) \pm h_2(x)</math>, then <math>h(i, y) = h(i\pm 1, x)</math>, and comparing additional hash values (expanding the range of <math>i</math>) is of no help.

=== Triple hashing ===
Adding a quadratic term <math>i^2,</math><ref name=Kirsch08>{{cite journal
 |title=Less Hashing, Same Performance: Building a Better Bloom Filter
 |first1=Adam |last1=Kirsch  |first2=Michael |last2=Mitzenmacher |authorlink2=Michael Mitzenmacher
 |journal=Random Structures and Algorithms |volume=33 |issue=2 |pages=187–218
 |date=September 2008 |doi=10.1002/rsa.20208 |citeseerx=10.1.1.152.579
 |url=https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
}}</ref> <math>i(i+1)/2</math> (a [[triangular number]]) or even <math>i^2 \cdot h_3(x)</math> ('''triple hashing''')<ref>Alternatively defined with the triangular number, as in Dillinger 2004.</ref> to the hash function improves the hash function somewhat<ref name=Kirsch08/> but does not fix this problem; if:
: <math>h_1(y) = h_1(x) + k \cdot h_2(x) + k^2 \cdot h_3(x),</math>
: <math>h_2(y) = -h_2(x) - 2k \cdot h_3(x),</math> and
: <math>h_3(y) = h_3(x).</math>
then
: <math>\begin{align}
h(k-i, y) &= h_1(y) + (k - i) \cdot h_2(y) + (k-i)^2 \cdot h_3(y) \\
          &= h_1(y) + (k - i) (-h_2(x) - 2k h_3(x)) + (k-i)^2 h_3(x) \\
          &= \ldots \\
          &= h_1(x) + k h_2(x) + k^2 h_3(x) + (i - k) h_2(x) + (i^2 - k^2) h_3(x) \\
          &= h_1(x) + i h_2(x) + i^2 h_3(x) \\
          &= h(i, x). \\
\end{align}</math>

=== Enhanced double hashing ===

Adding a [[cubic function|cubic term]] <math>i^3</math><ref name=Kirsch08/> or <math>(i^3-i)/6</math> (a [[tetrahedral number]]),<ref name=Dillinger04>{{cite conference
 |title=Bloom Filters in Probabilistic Verification
 |first1=Peter C. |last1=Dillinger  |first2=Panagiotis |last2=Manolios
 |conference=5h International Conference on Formal Methods in Computer Aided Design (FMCAD 2004)
 |location=Austin, Texas |date=November 15–17, 2004
 |doi=10.1007/978-3-540-30494-4_26 |citeseerx=10.1.1.119.628
 |url=https://www.khoury.northeastern.edu/~pete/pub/bloom-filters-verification.pdf
}}</ref> does solve the problem, a technique known as '''enhanced double hashing'''.  This can be computed efficiently by [[Forward difference|forward differencing]]:
<syntaxhighlight lang="c">
struct key;	/// Opaque
/// Use other data types when needed. (Must be unsigned for guaranteed wrapping.)
extern unsigned int h1(struct key const *), h2(struct key const *);

/// Calculate k hash values from two underlying hash functions
/// h1() and h2() using enhanced double hashing.  On return,
///     hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6.
/// Takes advantage of automatic wrapping (modular reduction)
/// of unsigned types in C.
void ext_dbl_hash(struct key const *x, unsigned int hashes[], unsigned int n)
{
	unsigned int a = h1(x), b = h2(x), i = 0;

    hashes[i] = a;
	for (i = 1; i < n; i++) {
		a += b;	// Add quadratic difference to get cubic
		b += i;	// Add linear difference to get quadratic
		       	// i++ adds constant difference to get linear
		hashes[i] = a;
	}
}
</syntaxhighlight>

In addition to rectifying the collision problem, enhanced double hashing also removes double-hashing's numerical restrictions on <math>h_2(x)</math>'s properties, allowing a hash function similar in property to (but still independent of) <math>h_1</math> to be used.<ref name=Dillinger04/>

==See also==
* [[Cuckoo hashing]]
* [[2-choice hashing]]

==References==
{{reflist}}

==External links==
*[http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf How Caching Affects Hashing] by Gregory L. Heileman and Wenbin Luo 2005.
*[http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html Hash Table Animation]
*[https://github.com/attractivechaos/klib klib] a C library that includes double hashing functionality.

[[Category:Search algorithms]]
[[Category:Hashing]]