32ビット整数内のセットビットの数を数える質問する

Question

これは「ハミングウェイト'、'popcount'、または 'sideways addition' です。

CPUによっては、これを実行するための単一の組み込み命令を持っているものもあれば、ビットベクトルに作用する並列命令を持っているものもあります。x86のpopcnt(サポートされている CPU では) 単一の整数に対してはほぼ確実に最速になります。他のアーキテクチャでは、サイクルごとにビットをテストするマイクロコードループで実装された低速の命令がある場合があります (引用が必要- ハードウェアポップカウントは、存在する場合は通常高速です)。

「最適な」アルゴリズムは、使用している CPU と使用パターンによって異なります。

コンパイラは、コンパイルする特定のCPUに適した方法を知っている場合があります。たとえば、C++20std::popcount()、またはC++std::bitset<32>::count()組み込み関数や固有関数にアクセスするためのポータブルな方法として（別の答え(この質問については) ただし、ハードウェア popcnt を持たないターゲット CPU に対するコンパイラのフォールバックの選択は、ユースケースに最適ではない可能性があります。または、言語 (例: C) では、CPU 固有の popcount が存在する場合にそれを使用できるポータブル関数が公開されていない可能性があります。

ハードウェアサポートを必要としない（またはハードウェアサポートの恩恵を受けない）ポータブルアルゴリズム

事前に入力されたテーブル検索方法は、CPU に大きなキャッシュがあり、タイトなループでこれらの操作を多数実行する場合、非常に高速になります。ただし、CPU がメインメモリからテーブルの一部をフェッチする必要がある「キャッシュミス」のコストがかかる可能性があります。(テーブルを小さく保つには、各バイトを個別に検索します。) 連続した数値の範囲の popcount が必要な場合は、256 個の数値のグループに対して下位バイトのみが変更されます。これはとても良い。

バイトがほとんど 0 またはほとんど 1 になることがわかっている場合は、これらのシナリオに対して効率的なアルゴリズムがあります。たとえば、ループ内でビットハックを使用して最下位セットをクリアし、ゼロになるまで続けます。

非常に優れた汎用アルゴリズムは、次に示す「並列」または「可変精度 SWAR アルゴリズム」と呼ばれるアルゴリズムであると私は考えています。私はこれを C のような疑似言語で表現しましたが、特定の言語で動作するように調整する必要があるかもしれません (たとえば、C++ では uint32_t を使用し、Java では >>> を使用します)。

GCC10 と clang 10.0 はこのパターン/イディオムを認識し、利用可能な場合はハードウェア popcnt または同等の命令にコンパイルして、両方の長所を活用できます。(ゴッドボルト）

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     i *= 0x01010101;                        // horizontal sum of bytes
     return  i >> 24;               // return just that top byte (after truncating to 32-bit even when int is wider than uint32_t)
}

JavaScriptの場合: パフォーマンスのために整数に強制変換します|0。最初の行を次のように変更します。i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:

Bit Twiddling Hacks
Introduction to Low Level Bit Hacks for bithack basics, like how subtracting 1 flips contiguous zeros.
Hamming weight (Wikipedia)
Fast Bit Counting
The Aggregate Magic Algorithms

How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

このビット単位の SWAR アルゴリズムは、単一の整数レジスタではなく、複数のベクトル要素で一度に並列化できるため、SIMD を備えながら使用可能なポップカウント命令がない CPU での速度が向上します。(たとえば、Nehalem 以降だけでなく、任意の CPU で実行する必要がある x86-64 コードなど)

ただし、ポップカウントにベクター命令を使用する最良の方法は、通常、変数シャッフルを使用して、各バイトの 4 ビットずつのテーブル検索を並列に実行することです (4 ビットは、ベクターレジスタに保持されている 16 エントリのテーブルをインデックスします)。

Intel CPUでは、ハードウェア64ビットpopcnt命令は、SSSE3PSHUFBビット並列実装約2倍ですが、コンパイラがそれを正しく理解すればそうでなければ、SSEは大幅に優位に立つ可能性があります。新しいコンパイラバージョンは、popcnt 誤った依存関係インテルの問題。

最先端の x86 SIMD ポップカウントSSSE3、AVX2、AVX512BW、AVX512VBMI、または AVX512 VPOPCNT 用。ベクトル間で Harley-Seal を使用して、要素内のポップカウントを延期します。(ARM NEON も)
AVX-512 または AVX-2 を使用して大規模データの 1 ビット (人口カウント) をカウントする
関連している：mklarqvist の位置ポップカウント- 複数の 8、16、32、または 64 ビット整数の各ビット位置を個別にカウントします。(ここでも、AVX-512 を含む x86 SIMD はこれに非常に優れており、vpternlogdHarley-Seal を非常に優れたものにしています。)

Answer 1

これは「ハミングウェイト'、'popcount'、または 'sideways addition' です。

CPUによっては、これを実行するための単一の組み込み命令を持っているものもあれば、ビットベクトルに作用する並列命令を持っているものもあります。x86のpopcnt(サポートされている CPU では) 単一の整数に対してはほぼ確実に最速になります。他のアーキテクチャでは、サイクルごとにビットをテストするマイクロコードループで実装された低速の命令がある場合があります (引用が必要- ハードウェアポップカウントは、存在する場合は通常高速です)。

「最適な」アルゴリズムは、使用している CPU と使用パターンによって異なります。

コンパイラは、コンパイルする特定のCPUに適した方法を知っている場合があります。たとえば、C++20std::popcount()、またはC++std::bitset<32>::count()組み込み関数や固有関数にアクセスするためのポータブルな方法として（別の答え(この質問については) ただし、ハードウェア popcnt を持たないターゲット CPU に対するコンパイラのフォールバックの選択は、ユースケースに最適ではない可能性があります。または、言語 (例: C) では、CPU 固有の popcount が存在する場合にそれを使用できるポータブル関数が公開されていない可能性があります。

ハードウェアサポートを必要としない（またはハードウェアサポートの恩恵を受けない）ポータブルアルゴリズム

事前に入力されたテーブル検索方法は、CPU に大きなキャッシュがあり、タイトなループでこれらの操作を多数実行する場合、非常に高速になります。ただし、CPU がメインメモリからテーブルの一部をフェッチする必要がある「キャッシュミス」のコストがかかる可能性があります。(テーブルを小さく保つには、各バイトを個別に検索します。) 連続した数値の範囲の popcount が必要な場合は、256 個の数値のグループに対して下位バイトのみが変更されます。これはとても良い。

バイトがほとんど 0 またはほとんど 1 になることがわかっている場合は、これらのシナリオに対して効率的なアルゴリズムがあります。たとえば、ループ内でビットハックを使用して最下位セットをクリアし、ゼロになるまで続けます。

非常に優れた汎用アルゴリズムは、次に示す「並列」または「可変精度 SWAR アルゴリズム」と呼ばれるアルゴリズムであると私は考えています。私はこれを C のような疑似言語で表現しましたが、特定の言語で動作するように調整する必要があるかもしれません (たとえば、C++ では uint32_t を使用し、Java では >>> を使用します)。

GCC10 と clang 10.0 はこのパターン/イディオムを認識し、利用可能な場合はハードウェア popcnt または同等の命令にコンパイルして、両方の長所を活用できます。(ゴッドボルト）

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     i *= 0x01010101;                        // horizontal sum of bytes
     return  i >> 24;               // return just that top byte (after truncating to 32-bit even when int is wider than uint32_t)
}

JavaScriptの場合: パフォーマンスのために整数に強制変換します|0。最初の行を次のように変更します。i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:

Bit Twiddling Hacks
Introduction to Low Level Bit Hacks for bithack basics, like how subtracting 1 flips contiguous zeros.
Hamming weight (Wikipedia)
Fast Bit Counting
The Aggregate Magic Algorithms

How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

このビット単位の SWAR アルゴリズムは、単一の整数レジスタではなく、複数のベクトル要素で一度に並列化できるため、SIMD を備えながら使用可能なポップカウント命令がない CPU での速度が向上します。(たとえば、Nehalem 以降だけでなく、任意の CPU で実行する必要がある x86-64 コードなど)

ただし、ポップカウントにベクター命令を使用する最良の方法は、通常、変数シャッフルを使用して、各バイトの 4 ビットずつのテーブル検索を並列に実行することです (4 ビットは、ベクターレジスタに保持されている 16 エントリのテーブルをインデックスします)。

Intel CPUでは、ハードウェア64ビットpopcnt命令は、SSSE3PSHUFBビット並列実装約2倍ですが、コンパイラがそれを正しく理解すればそうでなければ、SSEは大幅に優位に立つ可能性があります。新しいコンパイラバージョンは、popcnt 誤った依存関係インテルの問題。

最先端の x86 SIMD ポップカウントSSSE3、AVX2、AVX512BW、AVX512VBMI、または AVX512 VPOPCNT 用。ベクトル間で Harley-Seal を使用して、要素内のポップカウントを延期します。(ARM NEON も)
AVX-512 または AVX-2 を使用して大規模データの 1 ビット (人口カウント) をカウントする
関連している：mklarqvist の位置ポップカウント- 複数の 8、16、32、または 64 ビット整数の各ビット位置を個別にカウントします。(ここでも、AVX-512 を含む x86 SIMD はこれに非常に優れており、vpternlogdHarley-Seal を非常に優れたものにしています。)

32ビット整数内のセットビットの数を数える質問する

ベストアンサー1

ハードウェアサポートを必要としない（またはハードウェアサポートの恩恵を受けない）ポータブルアルゴリズム

How this SWAR bithack works:

With full SIMD for wider vectors (e.g. counting a whole array)

おすすめ記事