Editing Hash function (section)

== Properties ==
{{More citations needed|section|date=October 2017}}

=== Uniformity ===
A good hash function should map the expected inputs as evenly as possible over its output range.  That is, every hash value in the output range should be generated with roughly the same [[probability]]. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of ''collisions''—pairs of inputs that are mapped to the same hash value—increases.  If some hash values are more likely to occur than others, then a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.

This criterion only requires the value to be ''uniformly distributed'', not ''random'' in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.

In other words, if a typical set of {{math|''m''}} records is hashed to {{math|''n''}} table slots, then the probability of a bucket receiving many more than {{math|''m''/''n''}} records should be vanishingly small. In particular, if {{Math|''m'' < ''n''}}, then very few buckets should have more than one or two records.  A small number of collisions is virtually inevitable, even if {{math|''n''}} is much larger than {{math|''m''}}—see the [[birthday problem]].

In special cases when the keys are known in advance and the key set is static, a hash function can be found that achieves absolute (or collisionless) uniformity.  Such a hash function is said to be ''[[Perfect hash function|perfect]]''.  There is no algorithmic way of constructing such a function—searching for one is a [[factorial]] function of the number of keys to be mapped versus the number of table slots that they are mapped into.  Finding a perfect hash function over more than a very small set of keys is usually computationally infeasible; the resulting function is likely to be more computationally complex than a standard hash function and provides only a marginal advantage over a function with good statistical properties that yields a minimum number of collisions. See [[Universal hashing|universal hash function]].

=== Testing and measurement ===
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the [[chi-squared test]].   This test is a goodness-of-fit measure: it is the actual distribution of items in buckets versus the expected (or uniform) distribution of items. The formula is

<math>\frac{\sum_{j=0}^{m-1}(b_j)(b_j+1)/2}{(n/2m)(n+2m-1)},</math>
 
where {{Mvar|n}} is the number of keys, {{Mvar|m}} is the number of buckets, and {{Math|''b''<sub>''j''</sub>}} is the number of items in bucket {{Mvar|j}}.

A ratio within one confidence interval (such as 0.95 to 1.05) is indicative that the hash function evaluated has an expected uniform distribution.

Hash functions can have some technical properties that make it more likely that they will have a uniform distribution when applied.  One is the [[strict avalanche criterion]]: whenever a single input bit is complemented, each of the output bits changes with a 50% probability.  The reason for this property is that selected subsets of the keyspace may have low variability.  For the output to be uniformly distributed, a low amount of variability, even one bit, should translate into a high amount of variability (i.e. distribution over the tablespace) in the output.  Each bit should change with a probability of 50% because, if some bits are reluctant to change, then the keys become clustered around those values.  If the bits want to change too readily, then the mapping is approaching a fixed XOR function of a single bit.  Standard tests for this property have been described in the literature.<ref>{{cite journal |first1=Julio Cesar Hernandez |last1=Castro |first2=José María |last2=Sierra |first3=Andre |last3=Seznec |first4=Antonio |last4=Izquierdo |first5=Arturo |last5=Ribagorda |display-authors=1 |date=3 February 2005 |title=The strict avalanche criterion randomness test |journal=Mathematics and Computers in Simulation |volume=68 |issue=1 |pages=1–7 |publisher=[[Elsevier]] |doi=10.1016/j.matcom.2004.09.001|s2cid=18086276 }}</ref>  The relevance of the criterion to a multiplicative hash function is assessed here.<ref name="fibonacci-hashing">{{cite web |first=Malte |last=Sharupke |title=Fibonacci Hashing: The Optimization that the World Forgot |url=https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ |website=Probably Dance |date=16 June 2018}}</ref>

=== Efficiency ===
In data storage and retrieval applications, the use of a hash function is a trade-off between search time and data storage space.  If search time were unbounded, then a very compact unordered linear list would be the best medium; if storage space were unbounded, then a randomly accessible structure indexable by the key-value would be very large and very sparse, but very fast.  A hash function takes a finite amount of time to map a potentially large keyspace to a feasible amount of storage space searchable in a bounded amount of time regardless of the number of keys.  In most applications, the hash function should be computable with minimum latency and secondarily in a minimum number of instructions.

Computational complexity varies with the number of instructions required and latency of individual instructions, with the simplest being the bitwise methods (folding), followed by the multiplicative methods, and the most complex (slowest) are the division-based methods.

Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it is usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions.

Division-based implementations can be of particular concern because a division requires multiple cycles on nearly all processor [[microarchitecture]]s.  Division ([[Modulo operation|modulo]]) by a constant can be inverted to become a multiplication by the word-size multiplicative-inverse of that constant.  This can be done by the programmer, or by the compiler.  Division can also be reduced directly into a series of shift-subtracts and shift-adds, though minimizing the number of such operations required is a daunting problem; the number of machine-language instructions resulting may be more than a dozen and swamp the pipeline.  If the microarchitecture has [[hardware multiply]] [[functional unit]]s, then the multiply-by-inverse is likely a better approach.

We can allow the table size {{math|''n''}} to not be a power of 2 and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let {{math|''n''}} be significantly less than {{math|2<sup>''b''</sup>}}. Consider a [[pseudorandom number generator]] function {{math|''P''(key)}} that is uniform on the interval {{math|[0, 2<sup>''b''</sup>&nbsp;−&nbsp;1]}}. A hash function uniform on the interval {{math|[0, ''n'' &minus; 1]}} is {{math|''n'' ''P''(key) / 2<sup>''b''</sup>}}. We can replace the division by a (possibly faster) right [[bit shifting|bit shift]]: {{math|''n P''(key) >> ''b''}}.

If keys are being hashed repeatedly, and the hash function is costly, then computing time can be saved by precomputing the hash codes and storing them with the keys.  Matching hash codes almost certainly means that the keys are identical.  This technique is used for the transposition table in game-playing programs, which stores a 64-bit hashed representation of the board position.

=== Universality ===
{{main|Universal hashing}}
A ''universal hashing'' scheme is a [[randomized algorithm]] that selects a hash function {{math|''h''}} among a family of such functions, in such a way that the probability of a collision of any two distinct keys is {{math|1/''m''}}, where {{math|''m''}} is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function.

===Applicability===
A hash function that allows only certain table sizes or strings only up to a certain length, or cannot accept a seed (i.e. allow double hashing) is less useful than one that does.{{citation needed|date=February 2022}}

A hash function is applicable in a variety of situations. Particularly within cryptography, notable applications include:<ref>{{Citation |last1=Wagner |first1=Urs |title=Hash Functions |date=2023 |work=Trends in Data Protection and Encryption Technologies |pages=21–24 |editor-last=Mulder |editor-first=Valentin |place=Cham |publisher=Springer Nature Switzerland |language=en |doi=10.1007/978-3-031-33386-6_5 |isbn=978-3-031-33386-6 |last2=Lugrin |first2=Thomas |editor2-last=Mermoud |editor2-first=Alain |editor3-last=Lenders |editor3-first=Vincent |editor4-last=Tellenbach |editor4-first=Bernhard|doi-access=free }}</ref>

* [[File verification|Integrity checking]]: Identical hash values for different files imply equality, providing a reliable means to detect file modifications.
* [[Key derivation function|Key derivation]]: Minor input changes result in a random-looking output alteration, known as the diffusion property. Thus, hash functions are valuable for key derivation functions.
* [[Message authentication code]]s (MACs): Through the integration of a confidential key with the input data, hash functions can generate MACs ensuring the genuineness of the data, such as in [[HMAC]]s.
* Password storage: The password's hash value does not expose any password details, emphasizing the importance of securely storing hashed passwords on the server.
* [[Digital signature|Signatures]]: Message hashes are signed rather than the whole message.

=== Deterministic ===
A hash procedure must be [[deterministic algorithm|deterministic]]—for a given input value, it must always generate the same hash value. In other words, it must be a [[function (mathematics)|function]] of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as [[pseudo-random number generator]]s or the time of day. It also excludes functions that depend on the memory address of the object being hashed, because the address may change during execution (as may happen on systems that use certain methods of [[garbage collection (computer science)|garbage collection]]), although sometimes rehashing of the item is possible.

The determinism is in the context of the reuse of the function. For example, [[Python (programming language)|Python]] adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed.<ref>{{Cite web|url=https://docs.python.org/3/reference/datamodel.html#object.__hash__|title=3. Data model — Python 3.6.1 documentation|website=docs.python.org|access-date=2017-03-24}}</ref> The Python hash ([[SipHash]]) is still a valid hash function when used within a single run, but if the values are persisted (for example, written to disk), they can no longer be treated as valid hash values, since in the next run the random value might differ.

=== Defined range ===
It is often desirable that the output of a hash function have fixed size (but see below). If, for example, the output is constrained to 32-bit integer values, then the hash values can be used to index into an array. Such hashing is commonly used to accelerate data searches.<ref name="algorithms_in_java">{{cite book|title=Algorithms in Java|first=Robert|last=Sedgewick|year=2002|publisher=Addison Wesley|edition=3|isbn=978-0201361209|chapter=14. Hashing}}</ref> Producing fixed-length output from variable-length input can be accomplished by breaking the input data into chunks of specific size. Hash functions used for data searches use some arithmetic expression that iteratively processes chunks of the input (such as the characters in a string) to produce the hash value.<ref name="algorithms_in_java"/>

=== Variable range ===
In many applications, the range of hash values may be different for each run of the program or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters—the input data {{math|''z''}}, and the number {{math|''n''}} of allowed hash values.

A common solution is to compute a fixed hash function with a very large range (say, {{math|0}} to {{math|2<sup>32</sup>&nbsp;−&nbsp;1}}), divide the result by {{math|''n''}}, and use the division's [[modulo operation|remainder]]. If {{math|''n''}} is itself a power of {{math|2}}, this can be done by [[Mask (computing)|bit masking]] and [[bit shifting]]. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between {{math|0}} and {{math|''n''&nbsp;−&nbsp;1}}, for any value of {{math|''n''}} that may occur in the application. Depending on the function, the remainder may be uniform only for certain values of {{math|''n''}}, e.g. [[odd number|odd]] or [[prime number]]s.

=== Variable range with minimal movement (dynamic hash function) ===
When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table.

A hash function that will relocate the minimum number of records when the table is resized is desirable. What is needed is a hash function {{math|''H''(''z'',''n'')}} (where {{math|''z''}} is the key being hashed and {{math|''n''}} is the number of allowed hash values) such that {{math|1=''H''(''z'',''n''&nbsp;+&nbsp;1) = ''H''(''z'',''n'')}} with probability close to {{math|''n''/(''n''&nbsp;+&nbsp;1)}}.

[[Linear hashing]] and [[spiral hashing]] are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property. [[Extendible hashing]] uses a dynamic hash function that requires space proportional to {{math|''n''}} to compute the hash function, and it becomes a function of the previous keys that have been inserted. Several algorithms that preserve the uniformity property but require time proportional to {{math|''n''}} to compute the value of {{math|''H''(''z'',''n'')}} have been invented.{{clarify|date=September 2019}}

A hash function with minimal movement is especially useful in [[distributed hash table]]s.

=== Data normalization ===
In some applications, the input data may contain features that are irrelevant for comparison purposes. For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data [[equivalence relation|equivalence]] criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper-casing all letters.