Editing Hash function (section)

== Hashing variable-length data ==
When the data values are long (or variable-length) [[character string]]s—such as personal names, [[URL|web page addresses]], or mail messages—their distribution is usually very uneven, with complicated dependencies. For example, text in any [[natural language]] has highly non-uniform distributions of [[character (computing)|characters]], and [[digraph (computing)|character pairs]], characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way.{{clarify|date=September 2019}}

=== Middle and ends ===
Simplistic hash functions may add the first and last {{math|''n''}} characters of a string along with the length, or form a word-size hash from the middle 4 characters of a string.  This saves iterating over the (potentially long) string, but hash functions that do not hash on all characters of a string can readily become linear due to redundancies, clustering, or other pathologies in the key set.  Such strategies may be effective as a custom hash function if the structure of the keys is such that either the middle, ends, or other fields are zero or some other invariant constant that does not differentiate the keys; then the invariant parts of the keys can be ignored.

=== Character folding ===
The paradigmatic example of folding by characters is to add up the integer values of all the characters in the string.   A better idea is to multiply the hash total by a constant, typically a sizable prime number, before adding in the next character, ignoring overflow.  Using exclusive-or instead of addition is also a plausible alternative. The final operation would be a modulo, mask, or other function to reduce the word value to an index the size of the table. The weakness of this procedure is that information may cluster in the upper or lower bits of the bytes; this clustering will remain in the hashed result and cause more collisions than a proper randomizing hash. ASCII byte codes, for example, have an upper bit of 0, and printable strings do not use the last byte code or most of the first 32 byte codes, so the information, which uses the remaining byte codes, is clustered in the remaining bits in an unobvious manner.

The classic approach, dubbed the [[PJW hash function|PJW hash]] based on the work of [[Peter J. Weinberger]] at [[Bell Labs]] in the 1970s, was originally designed for hashing identifiers into compiler symbol tables as given in the [[Compilers: Principles, Techniques, and Tools|"Dragon Book"]].<ref>{{cite book |first1=A. |last1=Aho |author-link1=Alfred Aho |first2=R. |last2=Sethi |author-link2=Ravi Sethi |first3=J. D. |last3=Ullman |author-link3=Jeffrey Ullman |date=1986 |title=Compilers: Principles, Techniques and Tools |page=435 |publisher=[[Addison-Wesley]] |location=Reading, MA |isbn=0-201-10088-6}}</ref> This hash function offsets the bytes 4 bits before adding them together.  When the quantity wraps, the high 4 bits are shifted out and if non-zero, [[Exclusive or|xored]] back into the low byte of the cumulative quantity.  The result is a word-size hash code to which a modulo or other reducing operation can be applied to produce the final hash index.

Today, especially with the advent of 64-bit word sizes, much more efficient variable-length string hashing by word chunks is available.

=== Word length folding ===
{{See also|Universal hashing#Hashing strings}}
Modern microprocessors will allow for much faster processing if 8-bit character strings are not hashed by processing one character at a time, but by interpreting the string as an array of 32-bit or 64-bit integers and hashing/accumulating these "wide word" integer values by means of arithmetic operations (e.g. multiplication by constant and bit-shifting). The final word, which may have unoccupied byte positions, is filled with zeros or a specified randomizing value before being folded into the hash.  The accumulated hash code is reduced by a final modulo or other operation to yield an index into the table.

=== Radix conversion hashing ===
Analogous to the way an ASCII or [[EBCDIC]] character string representing a decimal number is converted to a numeric quantity for computing, a variable-length string can be converted as {{math|''x''<sub>''k''&minus;1</sub>a<sup>''k''−1</sup> + ''x''<sub>''k''&minus;2</sub>a<sup>''k''−2</sup> + &ctdot; + ''x''<sub>1</sub>''a'' + ''x''<sub>0</sub>}}. This is simply a polynomial in a [[radix]] {{math|1=''a''&nbsp;&gt;&nbsp;1}} that takes the components {{math|(''x''<sub>0</sub>,''x''<sub>1</sub>,...,''x''<sub>''k''−1</sub>)}} as the characters of the input string of length {{math|''k''}}. It can be used directly as the hash code, or a hash function applied to it to map the potentially large value to the hash table size. The value of {{math|''a''}} is usually a prime number large enough to hold the number of different characters in the character set of potential keys. Radix conversion hashing of strings minimizes the number of collisions.<ref>{{cite conference |last1=Ramakrishna |first1=M. V. |last2=Zobel |first2=Justin |date=1997 |url=https://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.18.7520&rep=rep1&type=pdf |title=Performance in Practice of String Hashing Functions |book-title=Database Systems for Advanced Applications '97 |conference=DASFAA 1997 |pages=215–224 |citeseerx=10.1.1.18.7520 |doi=10.1142/9789812819536_0023 |isbn=981-02-3107-5 |s2cid=8250194 |access-date=2021-12-06}}</ref> Available data sizes may restrict the maximum length of string that can be hashed with this method.  For example, a 128-bit word will hash only a 26-character alphabetic string (ignoring case) with a radix of 29; a printable ASCII string is limited to 9 characters using radix 97 and a 64-bit word.  However, alphabetic keys are usually of modest length, because keys must be stored in the hash table. Numeric character strings are usually not a problem; 64 bits can count up to {{math|10<sup>19</sup>}}, or 19 decimal digits with radix 10.

=== Rolling hash ===
{{Main|Rolling hash}}
{{See also|Linear congruential generator}}
In some applications, such as [[string searching algorithm|substring search]], one can compute a hash function {{math|''h''}} for every {{math|''k''}}-character [[substring]] of a given {{math|''n''}}-character string by advancing a window of width {{math|''k''}} characters along the string, where  {{math|''k''}} is a fixed integer, and {{Math|''n'' > ''k''}}. The straightforward solution, which is to extract such a substring at every character position in the text and compute {{math|''h''}} separately, requires a number of operations proportional to {{math|''k''·''n''}}. However, with the proper choice of {{math|''h''}}, one can use the technique of rolling hash to compute all those hashes with an effort proportional to {{math|''mk''&nbsp;+&nbsp;''n''}} where {{math|''m''}} is the number of occurrences of the substring.<ref>{{Cite book |last=Singh |first=N. B. |url=https://books.google.com/books?id=ALIMEQAAQBAJ&dq=rolling+hash&pg=PT102 |title=A Handbook of Algorithms |publisher=N.B. Singh |language=en}}</ref>{{fix|text=what is the choice of h?}}

The most familiar algorithm of this type is [[Rabin-Karp]] with best and average case performance {{math|''O''(''n''+''mk'')}} and worst case {{math|''O''(''n''·''k'')}} (in all fairness, the worst case here is gravely pathological: both the text string and substring are composed of a repeated single character, such as {{math|''t''}}="AAAAAAAAAAA", and {{math|''s''}}="AAA").  The hash function used for the algorithm is usually the [[Rabin fingerprint]], designed to avoid collisions in 8-bit character strings, but other suitable hash functions are also used.

=== Fuzzy hash ===
{{excerpt|Fuzzy hashing}}

=== Perceptual hash ===
{{excerpt|Perceptual hashing}}