Editing Hash table (section)

==Collision resolution==
{{see also| 2-choice hashing}}
A search algorithm that uses hashing consists of two parts. The first part is computing a [[hash function]] which transforms the search key into an [[array index]]. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.<ref name="donald3">{{cite book|title=The Art of Computer Programming: Volume 3: Sorting and Searching|publisher= Addison-Wesley Professional |author=[[Donald E. Knuth]]|date=24 April 1998|url=https://dl.acm.org/doi/10.5555/280635|isbn=978-0-201-89685-5}}</ref>{{rp|p=515}} Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.<ref name="algo1rob">{{cite book|first1=Robert|last1=Sedgewick|first2=Kevin|last2=Wayne|url=https://algs4.cs.princeton.edu/|via=[[Princeton University]], Department of Computer Science|title=Algorithms|edition=4|volume=1|publisher= Addison-Wesley Professional |year=2011|author-link1=Robert Sedgewick (computer scientist)}}</ref>{{rp|p=458}}

===Separate chaining===
[[File:Hash table 5 0 1 1 1 1 1 LL.svg|thumb|450px|right|Hash collision resolved by separate chaining]]
[[File:Hash table 5 0 1 1 1 1 0 LL.svg|thumb|right|500px|Hash collision by separate chaining with head records in the bucket array.]]

In separate chaining, the process involves building a [[linked list]] with [[key–value pair]] for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.{{r|algo1rob|p=464}} Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let <math>T</math> and <math>x</math> be the hash table and the node respectively, the operation involves as follows:<ref name="cormenalgo01">{{cite book|last1=Cormen |first1=Thomas H. |author1-link=Thomas H. Cormen|last2=Leiserson |first2=Charles E. |author2-link=Charles E. Leiserson|last3=Rivest |first3=Ronald L. |author3-link=Ronald L. Rivest|last4=Stein |first4=Clifford |author4-link=Clifford Stein| title = Introduction to Algorithms| publisher = [[Massachusetts Institute of Technology]]| year= 2001| isbn = 978-0-262-53196-2| edition = 2nd|chapter = Chapter 11: Hash Tables|title-link=Introduction to Algorithms }}</ref>{{rp|p=258}}
<!-- the lines with only a space are significant. don't remove the lines or the spaces -->
 Chained-Hash-Insert(''T'', ''k'')
   ''insert'' ''x'' ''at the head of linked list'' ''T''[''h''(''k'')]
 
 Chained-Hash-Search(''T'', ''k'')
   ''search for an element with key'' ''k'' ''in linked list'' ''T''[''h''(''k'')]
 
 Chained-Hash-Delete(''T'', ''k'')
   ''delete'' ''x'' ''from the linked list'' ''T''[''h''(''k'')]

If the element is comparable either [[Sequence#Analysis|numerically]] or [[Lexicographic order|lexically]], and inserted into the list by maintaining the [[total order]], it results in faster termination of the unsuccessful searches.{{r|donald3|pp=520-521}}

====Other data structures for separate chaining====

If the keys are [[total order|ordered]], it could be efficient to use "[[Optimal binary search tree|self-organizing]]" concepts such as using a [[self-balancing binary search tree]], through which the [[Worst-case complexity|theoretical worst case]] could be brought down to <math>O(\log{n})</math>, although it introduces additional complexities.{{r|donald3|p=521}}

In [[dynamic perfect hashing]], two-level hash tables are used to reduce the look-up complexity to be a guaranteed <math>O(1)</math> in the worst case. In this technique, the buckets of <math>k</math> entries are organized as [[Perfect hash function|perfect hash tables]] with <math>k^2</math> slots providing constant worst-case lookup time, and low amortized time for insertion.<ref>{{cite web |first1=Erik |last1=Demaine |first2=Jeff |last2=Lind |work=6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory |date=Spring 2003 |url=http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf |title=Lecture 2 |access-date=2008-06-30 |url-status=live |archive-url=https://web.archive.org/web/20100615203901/http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf |archive-date=June 15, 2010 |df=mdy-all }}</ref> A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.{{r|nick05|p=99}}

Techniques such as using [[fusion tree]] for each buckets also result in constant time for all operations with high probability.<ref>{{cite journal | last = Willard | first = Dan E. | author-link = Dan Willard | doi = 10.1137/S0097539797322425 | issue = 3 | journal = [[SIAM Journal on Computing]] | mr = 1740562 | pages = 1030–1049 | title = Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree | volume = 29 | year = 2000}}.</ref>

==== Caching and locality of reference ====
The linked list of separate chaining implementation may not be [[Cache-oblivious algorithm|cache-conscious]] due to [[spatial locality]]—[[locality of reference]]—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail [[CPU cache]] inefficiencies.<ref name="nick05">{{cite book |doi=10.1007/11575832_1 |chapter=Enhanced Byte Codes with Restricted Prefix Properties |title=String Processing and Information Retrieval |series=Lecture Notes in Computer Science |date=2005 |last1=Culpepper |first1=J. Shane |last2=Moffat |first2=Alistair |volume=3772 |pages=1–12 |isbn=978-3-540-29740-6 }}</ref>{{rp|p=91}}

In [[cache-oblivious algorithm|cache-conscious variants]] of collision resolution through separate chaining, a [[dynamic array]] found to be more [[CPU cache|cache-friendly]] is used in the place where a linked list or self-balancing binary search trees is usually deployed, since the [[Memory management (operating systems)#Single contiguous allocation|contiguous allocation]] pattern of  the array could be exploited by [[Cache prefetching|hardware-cache prefetchers]]—such as [[translation lookaside buffer]]—resulting in reduced access time and memory consumption.<ref>{{cite journal |last1=Askitis |first1=Nikolas |last2=Sinha |first2=Ranjan |title=Engineering scalable, cache and space efficient tries for strings |journal=The VLDB Journal |date=October 2010 |volume=19 |issue=5 |pages=633–660 |doi=10.1007/s00778-010-0183-9 }}</ref><ref>{{Cite conference | title=Cache-conscious Collision Resolution in String Hash Tables | first1=Nikolas | last1=Askitis | first2=Justin | last2=Zobel |date=October 2005 | isbn=978-3-540-29740-6 | pages=91–102 | book-title=Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005) | doi=10.1007/11575832_11 | volume=3772/2005}}</ref><ref>{{Cite conference |title       = Fast and Compact Hash Tables for Integer Keys |first1      = Nikolas |last1       = Askitis |year        = 2009 |isbn        = 978-1-920682-72-9 |url         = http://crpit.com/confpapers/CRPITV91Askitis.pdf |pages       = 113–122 |book-title     = Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009) |volume      = 91 |url-status  = dead |archive-url  = https://web.archive.org/web/20110216180225/http://crpit.com/confpapers/CRPITV91Askitis.pdf |archive-date = February 16, 2011 |df          = mdy-all |access-date = June 13, 2010 }}</ref>

===Open addressing===
{{Main|Open addressing}}
[[File:Hash table 5 0 1 1 1 1 0 SP.svg|thumb|380px|right|Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".]]
[[File:Hash table average insertion time.png|thumb|right|362px|This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to better [[locality of reference]], though as the table gets full, its performance degrades drastically.]]

[[Open addressing]] is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed through '''probing'''. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some ''probe sequence'', until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence,  until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search.<ref name="tenenbaum90">{{Cite book | title=Data Structures Using C | first1=Aaron M. | last1=Tenenbaum | first2=Yedidyah | last2=Langsam | first3=Moshe J. | last3=Augenstein | publisher=Prentice Hall | year=1990 | isbn=978-0-13-199746-2 | pages=456–461, p. 472 }}</ref>

Well-known probe sequences include:
* [[Linear probing]], in which the interval between probes is fixed (usually 1).<ref name="Cuckoo">{{Cite book | last1 = Pagh | first1 = Rasmus | author1-link = Rasmus Pagh | last2 = Rodler | first2 = Flemming Friche| chapter = Cuckoo Hashing | doi = 10.1007/3-540-44676-1_10 | title = Algorithms — ESA 2001 | series = Lecture Notes in Computer Science | volume = 2161 | pages = 121–133| year = 2001 | isbn = 978-3-540-42493-2 | citeseerx = 10.1.1.25.4189}}</ref>
* [[Quadratic probing]], in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation.{{r|clrs|p=272}}
* [[Double hashing]], in which the interval between probes is computed by a secondary hash function.{{r|clrs|pp=272-273}}

The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor <math>\alpha</math> approaches 1.<ref name="cornell08" />{{r|nick05|p=93}} The probing results in an [[infinite loop]] if the load factor reaches 1, in the case of a completely filled table.{{r|algo1rob|p=471}} The [[Average-case complexity|average cost]] of linear probing depends on the hash function's ability to [[Probability distribution|distribute]] the elements [[Continuous uniform distribution|uniformly]] throughout the table to avoid [[Cluster analysis|clustering]], since formation of clusters would result in increased search time.{{r|algo1rob|p=472}}

==== Caching and locality of reference ====
Since the slots are located in successive locations, linear probing could lead to better utilization of [[CPU cache]] due to [[locality of reference]]s resulting in reduced [[memory latency]].<ref name="Cuckoo" />

====Other collision resolution techniques based on open addressing====
=====Coalesced hashing=====
{{main| Coalesced hashing}}

[[Coalesced hashing]] is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table.<ref name="chen87">{{cite book|publisher=[[Oxford University Press]]|year=1987|first1=Jeffery S.|last1=Vitter|first2=Wen-Chin|last2=Chen|isbn= 978-0-19-504182-8 |location=New York, United States|url=https://archive.org/details/designanalysisof0000vitt/ |url-access= registration|via=[[Archive.org]]|title= The design and analysis of coalesced hashing}}</ref>{{rp|pp=6–8}} The algorithm is ideally suited for [[Memory pool|fixed memory allocation]].{{r|chen87|p=4}} The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.{{r|chen87|p=8}}

=====Cuckoo hashing=====
{{main|Cuckoo hashing}}

[[Cuckoo hashing]] is a form of open addressing collision resolution technique which guarantees <math>O(1)</math> worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters into [[infinite loop]]—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.<ref>{{Cite book | last1 = Pagh | first1 = Rasmus | author1-link = Rasmus Pagh | last2 = Rodler | first2 = Flemming Friche| chapter = Cuckoo Hashing | doi = 10.1007/3-540-44676-1_10 | title = Algorithms — ESA 2001 | series = Lecture Notes in Computer Science | volume = 2161| pages = 121–133 | year = 2001 | isbn = 978-3-540-42493-2 | citeseerx = 10.1.1.25.4189}}</ref>{{rp|pp=124–125}}

=====Hopscotch hashing=====
{{main| Hopscotch hashing}}

[[Hopscotch hashing]] is an open addressing based algorithm which combines the elements of [[cuckoo hashing]], [[linear probing]] and chaining through the notion of a ''neighbourhood'' of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.<ref name="nir08">{{cite book |doi=10.1007/978-3-540-87779-0_24 |chapter=Hopscotch Hashing |title=Distributed Computing |series=Lecture Notes in Computer Science |date=2008 |last1=Herlihy |first1=Maurice |last2=Shavit |first2=Nir |last3=Tzafrir |first3=Moran |volume=5218 |pages=350–364 |isbn=978-3-540-87778-3 }}</ref>{{rp|pp=351–352}} The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in [[Concurrent computing|concurrent settings]], thus well suited for implementing resizable [[concurrent hash table]].{{r|nir08|p=350}} The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.{{r|nir08|p=352}}

Each bucket within the hash table includes an additional "hop-information"—an ''H''-bit [[bit array]] for indicating the [[Euclidean distance#One dimension|relative distance]] of the item which was originally hashed into the current virtual bucket within ''H''&nbsp;−&nbsp;1 entries.{{r|nir08|p=352}} Let <math>k</math> and <math>Bk</math> be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed:{{r|nir08|pp=352–353}} if <math>Bk</math> is empty, the element is inserted, and the leftmost bit of bitmap is [[Bitwise operation|set]] to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of the ''neighbourhood,'' i.e. ''H''&nbsp;−&nbsp;1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhood [[Invariant (mathematics)|invariant properties]].{{r|nir08|p=353}}

=====Robin Hood hashing=====
Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest ''probe sequence length'' (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.<ref name="waterloo86">{{cite book|title=Robin Hood Hashing|first=Pedro|last=Celis|publisher=[[University of Waterloo]], Dept. of Computer Science|year=1986|url=https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf |location=Ontario, Canada|isbn= 978-0-315-29700-5 |oclc= 14083698|archive-url=https://web.archive.org/web/20211101071032/https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf|archive-date=1 November 2021|access-date=2 November 2021|url-status=live}}</ref>{{rp|p=12}} Although Robin Hood hashing does not change the [[Computational complexity theory|theoretical search cost]], it significantly affects the [[variance]] of the [[Probability distribution|distribution]] of the items on the buckets,<ref>{{cite journal |last1=Poblete |first1=P. V. |last2=Viola |first2=A. |title=Analysis of Robin Hood and Other Hashing Algorithms Under the Random Probing Model, With and Without Deletions |journal=Combinatorics, Probability and Computing |date=July 2019 |volume=28 |issue=4 |pages=600–617 |doi=10.1017/S0963548318000408 |s2cid=125374363 }}</ref>{{rp|p=2}} i.e. dealing with [[Cluster analysis|cluster]] formation in the hash table.<ref name="cornell14">{{cite web|url=https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html|title= Lecture 13: Hash tables|publisher=[[Cornell University]], Department of Computer Science|first=Michael|last=Clarkson|access-date=1 November 2021|year=2014|archive-url=https://web.archive.org/web/20211007011300/https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html|archive-date=7 October 2021|url-status=live|via=cs.cornell.edu}}</ref> Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.<ref>{{cite web|publisher=[[Cornell University]], Department of Computer Science|url=https://www.cs.cornell.edu/courses/JavaAndDS/files/hashing_RobinHood.pdf|title=JavaHyperText and Data Structure: Robin Hood Hashing|access-date=2 November 2021|first=David|last=Gries|year=2017|archive-url=https://web.archive.org/web/20210426051503/http://www.cs.cornell.edu/courses/JavaAndDS/files/hashing_RobinHood.pdf|archive-date=26 April 2021|url-status=live|via=cs.cornell.edu}}</ref> Let <math>x</math> be the key to be inserted, <math>x{.}\text{psl}</math> be the (incremental) PSL length of <math>x</math>, <math>T</math> be the hash table and <math>j</math> be the index, the insertion procedure is as follows:{{r|waterloo86|pp=12-13}}<ref name="indiana88">{{cite tech report|first=Pedro|last=Celis|date=28 March 1988| number=246|institution=[[Indiana University]], Department of Computer Science|location=Bloomington, Indiana| url=https://legacy.cs.indiana.edu/ftp/techreports/TR246.pdf|archive-url=https://web.archive.org/web/20211103013505/https://legacy.cs.indiana.edu/ftp/techreports/TR246.pdf|archive-date=3 November 2021|access-date=2 November 2021|url-status=live| title=External Robin Hood Hashing}}</ref>{{rp|p=5}}
* If <math>x{.}\text{psl}\ \le\ T[j]{.}\text{psl}</math>: the iteration goes into the next bucket without attempting an external probe.
* If <math>x{.}\text{psl}\ >\ T[j]{.}\text{psl}</math>: insert the item <math>x</math> into the bucket <math>j</math>; swap <math>x</math> with <math>T[j]</math>—let it be <math>x'</math>; continue the probe from the <math>(j+1)</math>th bucket to insert <math>x'</math>; repeat the procedure until every element is inserted.