Editing Hamming weight (section)

== Efficient implementation ==
The population count of a [[bit array|bitstring]] is often needed in cryptography and other applications. The [[Hamming distance]] of two words ''A'' and ''B'' can be calculated as the Hamming weight of ''A'' [[xor]] ''B''.<ref name="Warren_2013"/>

The problem of how to implement it efficiently has been widely studied. A single operation for the calculation, or parallel operations on bit vectors are [[#Processor support|available on some processors]]. For processors lacking those features, the best solutions known are based on adding counts in a tree pattern. For example, to count the number of 1 bits in the 16-bit binary number a&nbsp;=&nbsp;0110&nbsp;1100&nbsp;1011&nbsp;1010, these operations can be done:

{| class="wikitable" style="text-align:center;"
! Expression
! colspan=8 | Binary
! Decimal
! Comment
|-
| style="text-align:left;" | <code>a</code>
| 01
| 10
| 11
| 00
| 10
| 11
| 10
| 10
| style="text-align:left;"   {{n/a|27834}}
| style="text-align:left;" | The original number
|-
| style="text-align:left;" | <code>b0 = (a >> 0) & 01 01 01 01 01 01 01 01</code>
| 01
| 00
| 01
| 00
| 00
| 01
| 00
| 00
| style="text-align:left;" | 1, 0, 1, 0, 0, 1, 0, 0
| style="text-align:left;" | Every other bit from a
|-
| style="text-align:left;" | <code>b1 = (a >> 1) & 01 01 01 01 01 01 01 01</code>
| 00
| 01
| 01
| 00
| 01
| 01
| 01
| 01
| style="text-align:left;" | 0, 1, 1, 0, 1, 1, 1, 1
| style="text-align:left;" | The remaining bits from a
|-
| style="text-align:left;" | <code>c = b0 + b1</code>
| 01
| 01
| 10
| 00
| 01
| 10
| 01
| 01
| style="text-align:left;" | 1, 1, 2, 0, 1, 2, 1, 1
| style="text-align:left;" | Count of 1s in each 2-bit slice of a
|-
| style="text-align:left;" | <code>d0 = (c >> 0) & 0011 0011 0011 0011</code>
| colspan=2 | 0001
| colspan=2 | 0000
| colspan=2 | 0010
| colspan=2 | 0001
| style="text-align:left;" | 1, 0, 2, 1
| style="text-align:left;" | Every other count from c
|-
| style="text-align:left;" | <code>d2 = (c >> 2) & 0011 0011 0011 0011</code>
| colspan=2 | 0001
| colspan=2 | 0010
| colspan=2 | 0001
| colspan=2 | 0001
| style="text-align:left;" | 1, 2, 1, 1
| style="text-align:left;" | The remaining counts from c
|-
| style="text-align:left;" | <code>e = d0 + d2</code>
| colspan=2 | 0010
| colspan=2 | 0010
| colspan=2 | 0011
| colspan=2 | 0010
| style="text-align:left;" | 2, 2, 3, 2
| style="text-align:left;" | Count of 1s in each 4-bit slice of a
|-
| style="text-align:left;" | <code>f0 = (e >> 0) & 00001111  00001111</code>
| colspan=4 | 00000010
| colspan=4 | 00000010
| style="text-align:left;" | 2, 2
| style="text-align:left;" | Every other count from e
|-
| style="text-align:left;" | <code>f4 = (e >> 4) & 00001111 00001111</code>
| colspan=4 | 00000010
| colspan=4 | 00000011
| style="text-align:left;" | 2, 3
| style="text-align:left;" | The remaining counts from e
|-
| style="text-align:left;" | <code>g = f0 + f4</code>
| colspan=4 | 00000100
| colspan=4 | 00000101
| style="text-align:left;" | 4, 5
| style="text-align:left;" | Count of 1s in each 8-bit slice of a
|-
| style="text-align:left;" | <code>h0 = (g >> 0) & 0000000011111111</code>
| colspan=8 | 0000000000000101
| style="text-align:left;" | 5
| style="text-align:left;" | Every other count from g
|-
| style="text-align:left;" | <code>h8 = (g >> 8) & 0000000011111111</code>
| colspan=8 | 0000000000000100
| style="text-align:left;" | 4
| style="text-align:left;" | The remaining counts from g
|-
| style="text-align:left;" | <code>i = h0 + h8</code>
| colspan=8 | 0000000000001001
| style="text-align:left;" | 9
| style="text-align:left;" | Count of 1s in entire 16-bit word
|}

Here, the operations are as in [[C (programming language)|C programming language]], so {{code|X >> Y}} means to shift X right by Y bits, X & Y means the bitwise AND of X and Y, and + is ordinary addition.  The best algorithms known for this problem are based on the concept illustrated above and are given here:<ref name="Warren_2013"/>

<syntaxhighlight lang="c">
//types and constants used in the functions below
//uint64_t is an unsigned 64-bit integer variable type (defined in C99 version of C language)
const uint64_t m1  = 0x5555555555555555; //binary: 0101...
const uint64_t m2  = 0x3333333333333333; //binary: 00110011..
const uint64_t m4  = 0x0f0f0f0f0f0f0f0f; //binary:  4 zeros,  4 ones ...
const uint64_t m8  = 0x00ff00ff00ff00ff; //binary:  8 zeros,  8 ones ...
const uint64_t m16 = 0x0000ffff0000ffff; //binary: 16 zeros, 16 ones ...
const uint64_t m32 = 0x00000000ffffffff; //binary: 32 zeros, 32 ones
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...

//This is a naive implementation, shown for comparison,
//and to help in understanding the better functions.
//This algorithm uses 24 arithmetic operations (shift, add, and).
int popcount64a(uint64_t x)
{
    x = (x & m1 ) + ((x >>  1) & m1 ); //put count of each  2 bits into those  2 bits 
    x = (x & m2 ) + ((x >>  2) & m2 ); //put count of each  4 bits into those  4 bits 
    x = (x & m4 ) + ((x >>  4) & m4 ); //put count of each  8 bits into those  8 bits 
    x = (x & m8 ) + ((x >>  8) & m8 ); //put count of each 16 bits into those 16 bits 
    x = (x & m16) + ((x >> 16) & m16); //put count of each 32 bits into those 32 bits 
    x = (x & m32) + ((x >> 32) & m32); //put count of each 64 bits into those 64 bits 
    return x;
}

//This uses fewer arithmetic operations than any other known  
//implementation on machines with slow multiplication.
//This algorithm uses 17 arithmetic operations.
int popcount64b(uint64_t x)
{
    x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
    x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
    x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
    x += x >>  8;  //put count of each 16 bits into their lowest 8 bits
    x += x >> 16;  //put count of each 32 bits into their lowest 8 bits
    x += x >> 32;  //put count of each 64 bits into their lowest 8 bits
    return x & 0x7f;
}

//This uses fewer arithmetic operations than any other known  
//implementation on machines with fast multiplication.
//This algorithm uses 12 arithmetic operations, one of which is a multiply.
int popcount64c(uint64_t x)
{
    x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
    x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
    x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
    return (x * h01) >> 56;  //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ... 
}
</syntaxhighlight>

The above implementations have the best worst-case behavior of any known algorithm.  However, when a value is expected to have few nonzero bits, it may instead be more efficient to use algorithms that count these bits one at a time. As Wegner described in 1960,<ref name="Wegner_1960"/> the [[bitwise AND]] of ''x'' with ''x''&nbsp;&minus;&nbsp;1 differs from ''x'' only in zeroing out the least significant nonzero bit: subtracting 1 changes the rightmost string of 0s to 1s, and changes the rightmost 1 to a 0. If ''x'' originally had ''n'' bits that were 1, then after only ''n'' iterations of this operation, ''x'' will be reduced to zero. The following implementation is based on this principle.

<syntaxhighlight lang="c">
//This is better when most bits in x are 0
//This algorithm works the same for all data sizes.
//This algorithm uses 3 arithmetic operations and 1 comparison/branch per "1" bit in x.
int popcount64d(uint64_t x)
{
    int count;
    for (count=0; x; count++)
        x &= x - 1;
    return count;
}
</syntaxhighlight>

If greater memory usage is allowed, we can calculate the Hamming weight faster than the above methods.  With unlimited memory, we could simply create a large lookup table of the Hamming weight of every 64 bit integer.  If we can store a lookup table of the hamming function of every 16 bit integer, we can do the following to compute the Hamming weight of every 32 bit integer.
<syntaxhighlight lang="c">
static uint8_t wordbits[65536] = { /* bitcounts of integers 0 through 65535, inclusive */ };
//This algorithm uses 3 arithmetic operations and 2 memory reads.
int popcount32e(uint32_t x)
{
    return wordbits[x & 0xFFFF] + wordbits[x >> 16];
}
</syntaxhighlight>

<syntaxhighlight lang="c">
//Optionally, the wordbits[] table could be filled using this function
int popcount32e_init(void)
{
    uint32_t i;
    uint16_t x;
    int count;
    for (i=0; i <= 0xFFFF; i++)
    {
        x = i;
        for (count=0; x; count++) // borrowed from popcount64d() above
            x &= x - 1;
        wordbits[i] = count;
    }
}
</syntaxhighlight>

A recursive algorithm is given in Donovan & Kernighan<ref name="Donovan_2016" />
<syntaxhighlight lang="c">
/* The weight of i can differ from the weight of i / 2
only in the least significant bit of i */
int popcount32e_init (void)
{
    int i;
    for (i = 1; sizeof wordbits / sizeof *wordbits > i; ++i)
        wordbits [i] = wordbits [i >> 1] + (1 & i);
}
</syntaxhighlight>

Muła et al.<ref name="Mula_2018"/> have shown that a vectorized version of popcount64b can run faster than dedicated instructions (e.g., popcnt on x64 processors).