c++ - when - Bit counting in a contiguous memory chunk

types of pointers in c++ (6)

An excellent recent study comparing several of the most modern techniques for counting the number of 'set' (1-valued) bits in a range of memory (aka Hamming Weight, bitset cardinality, sideways sum, population count or popcnt, etc.) can be found in Wojciech, Kurz, and Lemire (2017), Faster population counts using AVX2 instructions 1

The following is a complete, tested, and fully-working C# adaptation of the "Harley-Seal" algorithm from that paper, which the authors found to be the fastest method that uses general-purpose bitwise operations (that is, that doesn't require special hardware).

1. Managed array entry points
(optional) Provides access to the block-optimized bit-counting for managed array ulong[].

/// <summary> Returns the total number of 1-valued bits in the array </summary>
public static int OnesCount(ulong[] rg) => OnesCount(rg, 0, rg.Length);

/// <summary> Finds the total number of '1' bits in an array or its subset </summary>
/// <param name="rg"> Array of ulong values to scan </param>
/// <param name="index"> Starting index in the array </param>
/// <param name="count"> Number of ulong values to examine, starting at 'i' </param>
public static int OnesCount(ulong[] rg, int index, int count)
    if ((index | count) < 0 || index > rg.Length - count)
        throw new ArgumentException();

    fixed (ulong* p = &rg[index])
        return OnesCount(p, count);

2. Scalar API
Used by the block-optimized counter to aggregate results from the carry-save adder, and also to finish up any remainder for block sizes not divisible by the optimized chunk size of 16 x 8 bytes/ulong = 128 bytes. Suitable for general-purpose use also.

/// <summary> Finds the Hamming Weight or ones-count of a ulong value </summary>
/// <returns> The number of 1-bits that are set in 'x' </returns>
public static int OnesCount(ulong x)
    x -= (x >> 1) & 0x5555555555555555;
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    return (int)((((x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F) * 0x0101010101010101) >> 56);

3. "Harley-Seal" block-optimized 1s-bit counter
Processes blocks of 128 bytes at a time, i.e., 16 ulong values per block. Uses the carry-save adder (shown below) to gang-add single bits across adjacent ulongs, and aggregates totals upwards as powers of two.

/// <summary> Count the number of 'set' (1-valued) bits in a range of memory. </summary>
/// <param name="p"> Pointer to an array of 64-bit ulong values to scan </param>
/// <param name="c"> Size of the memory block as a count of 64-bit ulongs </param>
/// <returns> The total number of 1-bits </returns>
public static int OnesCount(ulong* p, int c)
    ulong z, y, x, w;
    int c = 0;

    for (w = x = y = z = 0UL; cq >= 16; cq -= 16)
        c += OnesCount(CSA(ref w,
                            CSA(ref x,
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++)),
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++))),
                            CSA(ref x,
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++)),
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++)))));

    c <<= 4;
    c += (OnesCount(w) << 3) + (OnesCount(x) << 2) + (OnesCount(y) << 1) + OnesCount(z);

    while (--cq >= 0)
        c += OnesCount(*p++);

    return c;

4. Carry-save adder (CSA)

/// <summary> carry-save adder </summary>
static ulong CSA(ref ulong a, ulong b, ulong c)
    ulong v = a & b | (a ^ b) & c;
    a ^= b ^ c;
    return v;


Because the approach shown here counts the total number of 1-bits by proceeding 128-byte chunks at a time, it only becomes optimal with larger memory block sizes. For example, likely at least some (small) multiple of that sixteen-qword (16-ulong) chunk size. For counting 1-bits in smaller memory ranges, this code will work correctly, but drastically underperform more naïve methods. See the paper for details.

From the paper, this diagram summarizes how the Carry-Save Adder works:


[1.] Muła, Wojciech, Nathan Kurz, and Daniel Lemire. "Faster population counts using AVX2 instructions." The Computer Journal 61, no. 1 (2017): 111-120.

I was asked in an interview the following question.

int countSetBits(void *ptr, int start, int end); 

Synopsis: Assume that ptr points to a big chunk of memory. Viewing this memory as contiguous sequence of bits, start and end are bit positions. Assume start and end have proper values and ptr is pointing to an initialized chunck of memory.

Question: Write a C code to count number of bits set from start to end [inclusive] and return the count.

Just to make it more clear

         | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
         | 8 | 9 |                   |15 |
         |                               |
         |               | S |           |
         |    | E |                      |

My solution:

int countSetBits(void *ptr, int start, int end )
    int count = 0, idx; 

    char *ch; 

    for (idx = start; idx <= end; idx++) 
    {     ch = ptr + (idx/8); 

          if((128 >> (idx%8)) & (*ch)) 

    return count; 

I gave a very lengthy and somewhat inefficient code during the interview. I worked on it later and came up with above solution.

I am very sure SO community can provide more elegant solution. I am just curious to see their response.

PS: Above code is not compiled. It is more like a pseudo code and may contain errors.

Boundary conditions, they get no respect...

Everyone here seems to be concentrating on the lookup table to count the bits. And that's OK, but I think that even more important when answering an interview question is to make sure you handle the boundary conditions.

The look up table is just an optimization. It's much more important to get the answer right than to get it fast. If this were my interview, going straight for the lookup table without even mentioning that there are some tricky details about handling the first few and last few bits that aren't on full-byte boundaries would be worse than coming up with a solution that counted each bit ploddingly, but got the boundary conditions right.

So I think Bhaskar's solution in his question is probably superior to the most of the answers mentioned here - it seems to handle the boundary conditions.

Here's a solution that uses a lookup table and tries to still handle the boundaries (it's only lightly tested, so I won't claim that it's 100% correct). It's also uglier than I'd like, but it's late:

typedef unsigned char uint8_t;

size_t bits_in_byte( uint8_t val)
    static int const half_byte[] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 };

    int result1 = half_byte[val & 0x0f];
    int result2 = half_byte[(val >> 4) & 0x0f];

    return result1 + result2;

int countSetBits( void* ptr, int start, int end) 
    uint8_t*    first;
    uint8_t*    last;
    int         bits_first;
    int         bits_last;
    uint8_t     mask_first;
    uint8_t     mask_last;

    size_t count = 0;

    // get bits from the first byte
    first = ((uint8_t*) ptr) + (start / 8);
    bits_first = 8 - start % 8;
    mask_first = (1 << bits_first) - 1;
    mask_first = mask_first << (8 - bits_first);

    // get bits from last byte
    last = ((uint8_t*) ptr) + (end / 8);
    bits_last = 1 + (end % 8);
    mask_last = (1 << bits_last) - 1;

    if (first == last) {
        // we only have a range of bits in  the first byte
        count = bits_in_byte( (*first) & mask_first & mask_last);        
    else {
        // handle the bits from the first and last bytes specially
        count += bits_in_byte((*first) & mask_first);
        count += bits_in_byte((*last) & mask_last);

        // now we've collected the odds and ends from the start and end of the bit range
        // handle the full bytes in the interior of the range

        for (first = first+1; first != last; ++first) {
            count += bits_in_byte(*first);

    return count;

Note that a detail that would have to be worked out as part of the interview is whether the bits within a byte are indexed starting at the least-significant-bit (lsb) or most-significant-bit (msb). In other words, if the start index were specified as 0, would a byte with the value 0x01 or a byte with the value 0x80 have the bit set in that index? Sort of like deciding whether the indexes consider the bit order within a byte as big-endian or little-endian.

There's no 'right' answer for this - the interviewer would have to specify what the behavior should be. I'll also note that my example solution handles this in the opposite way to the OP's example code (I was going by how I interpreted the diagram, with the indexes reading as 'bit numbers' as well). The OPs' solution considers the bit order as big-endian, my function treats them as little-endian. So even though both handle partial bytes at the star & end of the range, they'll give different answers. Which is the right answer depends on what the actual spec for the problem is.

Disclaimer: No attempt to compile the following code has been made.

 * Table counting the number of set bits in a byte.
 * The byte is the index to the table.
uint8_t  table[256] = {...};

 * countBits - count the number of set bits in a range
 * The most significant bit in the byte is considered to be bit 0.
 * RETURNS: 0 on success, -1 on failure
int countBits (
    uint8_t *  buffer,
    int        startBit,  /* starting bit */
    int        endBit,    /* End-bit (inlcusive) */
    unsigned * pTotal     /* Output: number of consecutively set bits */
    ) {
    int      numBits;     /* number of bits left to check */
    int      mask;        /* mask to apply to byte from <buffer> */
    int      bits;        /* # of bits to end of byte */
    unsigned count = 0;   /* total number of bits set */
    uint8_t  value;       /* value read from the buffer */

    /* Return -1 if parameters fail sanity check (skipped) */

    numBits   = (endBit - startBit) + 1;

    index  = startBit >> 3;
    bits   = 8 - (startBit & 7);
    mask   = (1 << bits) - 1;

    value = buffer[index] & mask;  /* mask-out any bits preceding <startBit> */
    numBits -= bits;

    while (numBits > 0) {          /* Note: if <startBit> and <endBit> are in */
        count += table[value];     /* same byte, this loop gets skipped. */
        value = buffer[index];
        numBits -= 8;

    if (numBits < 0) {             /* mask-out any bits following <endBit> */
        bits   = 8 - (endBit & 7);
        mask   = 0xff << bits;
        value &= mask;

    count += table[value];

    *pTotal = count;
    return 0;

Edit: Function header updated.

The most quick and efficient way to my opinion is to use a table of 256 entries, where every element represents number of bits in the index. Index is a next byte from the memory location.

something like this:

int bit_table[256] = {0, 1, 1, 2, 1, ...};
char* p = ptr + start;
int count = 0;
for (p; p != ptr + end; p++)
    count += bit_table[*(unsigned char*)p];

There are many ways to solve the problem. This is a good post that compares the performance of the most common options.

You might find this page interesting, it contains several alternative solutions for your problem.