ソートされた配列を処理する方が、ソートされていない配列を処理するよりも高速なのはなぜですか? 質問する

Question

あなたは分岐予測の失敗の被害者です。

分岐予測とは何ですか?

鉄道のジャンクションを考えてみましょう。

_{画像はMecanismoによるもので、Wikimedia Commonsから提供されています。CC -By-SA 3.0ライセンスに基づいて使用されています。}

さて、議論のために、これが長距離通信や無線通信が登場する前の 1800 年代に戻ったと仮定しましょう。

あなたは交差点の盲目のオペレーターで、電車が来る音が聞こえます。どの方向に進むべきか全く分かりません。電車を止めて運転手にどの方向に進むか尋ねます。そして、適切にスイッチを設定します。

電車は重くて慣性が大きいため、発進や減速に時間がかかります。

もっといい方法はありませんか？電車がどの方向に行くかを推測してください！

正解なら、続きです。
間違った推測をした場合、運転手は車を止めて後退し、スイッチを入れるように叫びます。その後、車は他の経路で再始動できます。

毎回正しく推測できれば、列車は停止する必要がなくなります。
間違った推測を頻繁に行うと、列車は停止、後退、再起動に多くの時間を費やすことになります。

if文を考えてみましょう。プロセッサレベルでは、これは分岐命令です。

if(x >= 128) は、jump-if-less-than プロセッサ命令にコンパイルされます。

あなたはプロセッサで、分岐を見ました。どちらの方向に進むかはわかりません。どうしますか? 実行を停止し、前の命令が完了するまで待ちます。その後、正しいパスに進みます。

現代のプロセッサは複雑で、パイプラインが長いです。つまり、「ウォームアップ」と「スローダウン」に非常に長い時間がかかります。

もっと良い方法はありますか? 枝がどの方向に行くかを推測してください!

正しく推測した場合は、実行を続けます。
間違った推測をした場合は、パイプラインをフラッシュしてブランチにロールバックする必要があります。その後、他のパスを再開できます。

毎回正しく推測できれば、実行を停止する必要がなくなります。
間違った推測が頻繁に発生すると、停止、ロールバック、再起動に多くの時間を費やすことになります。

これは分岐予測です。列車は旗で方向を知らせることができるので、これは最良の例えではないことは認めます。しかし、コンピューターでは、プロセッサは最後の瞬間まで分岐がどの方向に進むかわかりません。

列車が後退して別の経路を進む回数を最小限に抑えるために、戦略的に推測するにはどうすればよいでしょうか。過去の履歴を見てみましょう。列車が 99% の確率で左に進む場合は、左を推測します。交互に進む場合は、推測を交互に行います。列車が 3 回に 1 回、同じ方向に進む場合は、同じことを推測します...

言い換えれば、パターンを識別してそれに従うようにします。これは、分岐予測子が動作する仕組みとほぼ同じです。

ほとんどのアプリケーションでは、分岐は適切に行われます。そのため、最新の分岐予測器は通常、90% を超えるヒット率を達成します。しかし、認識可能なパターンのない予測不可能な分岐に直面した場合、分岐予測器は事実上役に立ちません。

さらに詳しい情報: Wikipedia の「分岐予測子」の記事。

上記から示唆されているように、原因は次の if ステートメントです。

if (data[c] >= 128)
    sum += data[c];

データが 0 から 255 の間で均等に分散されていることに注意してください。データがソートされると、反復処理の最初の半分は if ステートメントに入りません。その後は、すべての反復処理が if ステートメントに入ります。

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T  ...

       = TTNTTTTNTNNTTT ...   (completely random - impossible to predict)

What can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

_{(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)}

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

Scenario	Time (seconds)
Branching - Random data	11.777
Branching - Sorted data	2.352
Branchless - Random data	2.564
Branchless - Sorted data	2.587

Java - NetBeans 7.1.1 JDK 7 - x64

Scenario	Time (seconds)
Branching - Random data	10.93293813
Branching - Sorted data	5.643797077
Branchless - Random data	3.113581453
Branchless - Sorted data	3.186068823

Observations:

With the Branch: There is a huge difference between the sorted and unsorted data.
With the Hack: There is no difference between sorted and unsorted data.
In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

A general rule of thumb is to avoid data-dependent branching in critical loops (such as in this example).

Update:

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast. This is called "if-conversion" (to branchless) and is necessary for vectorization but also sometimes good for scalar.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2-cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Clang also vectorizes the if() version, as will GCC 5 and later with -O3, even though it takes quite a few instructions to sign-extend to the 64-bit sum on x86 without SSE4 or AVX2. (-march=x86-64-v2 or v3). See Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

Answer 1

あなたは分岐予測の失敗の被害者です。

分岐予測とは何ですか?

鉄道のジャンクションを考えてみましょう。

_{画像はMecanismoによるもので、Wikimedia Commonsから提供されています。CC -By-SA 3.0ライセンスに基づいて使用されています。}

さて、議論のために、これが長距離通信や無線通信が登場する前の 1800 年代に戻ったと仮定しましょう。

あなたは交差点の盲目のオペレーターで、電車が来る音が聞こえます。どの方向に進むべきか全く分かりません。電車を止めて運転手にどの方向に進むか尋ねます。そして、適切にスイッチを設定します。

電車は重くて慣性が大きいため、発進や減速に時間がかかります。

もっといい方法はありませんか？電車がどの方向に行くかを推測してください！

正解なら、続きです。
間違った推測をした場合、運転手は車を止めて後退し、スイッチを入れるように叫びます。その後、車は他の経路で再始動できます。

毎回正しく推測できれば、列車は停止する必要がなくなります。
間違った推測を頻繁に行うと、列車は停止、後退、再起動に多くの時間を費やすことになります。

if文を考えてみましょう。プロセッサレベルでは、これは分岐命令です。

if(x >= 128) は、jump-if-less-than プロセッサ命令にコンパイルされます。

あなたはプロセッサで、分岐を見ました。どちらの方向に進むかはわかりません。どうしますか? 実行を停止し、前の命令が完了するまで待ちます。その後、正しいパスに進みます。

現代のプロセッサは複雑で、パイプラインが長いです。つまり、「ウォームアップ」と「スローダウン」に非常に長い時間がかかります。

もっと良い方法はありますか? 枝がどの方向に行くかを推測してください!

正しく推測した場合は、実行を続けます。
間違った推測をした場合は、パイプラインをフラッシュしてブランチにロールバックする必要があります。その後、他のパスを再開できます。

毎回正しく推測できれば、実行を停止する必要がなくなります。
間違った推測が頻繁に発生すると、停止、ロールバック、再起動に多くの時間を費やすことになります。

これは分岐予測です。列車は旗で方向を知らせることができるので、これは最良の例えではないことは認めます。しかし、コンピューターでは、プロセッサは最後の瞬間まで分岐がどの方向に進むかわかりません。

列車が後退して別の経路を進む回数を最小限に抑えるために、戦略的に推測するにはどうすればよいでしょうか。過去の履歴を見てみましょう。列車が 99% の確率で左に進む場合は、左を推測します。交互に進む場合は、推測を交互に行います。列車が 3 回に 1 回、同じ方向に進む場合は、同じことを推測します...

言い換えれば、パターンを識別してそれに従うようにします。これは、分岐予測子が動作する仕組みとほぼ同じです。

ほとんどのアプリケーションでは、分岐は適切に行われます。そのため、最新の分岐予測器は通常、90% を超えるヒット率を達成します。しかし、認識可能なパターンのない予測不可能な分岐に直面した場合、分岐予測器は事実上役に立ちません。

さらに詳しい情報: Wikipedia の「分岐予測子」の記事。

上記から示唆されているように、原因は次の if ステートメントです。

if (data[c] >= 128)
    sum += data[c];

データが 0 から 255 の間で均等に分散されていることに注意してください。データがソートされると、反復処理の最初の半分は if ステートメントに入りません。その後は、すべての反復処理が if ステートメントに入ります。

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T  ...

       = TTNTTTTNTNNTTT ...   (completely random - impossible to predict)

What can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

_{(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)}

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

Scenario	Time (seconds)
Branching - Random data	11.777
Branching - Sorted data	2.352
Branchless - Random data	2.564
Branchless - Sorted data	2.587

Java - NetBeans 7.1.1 JDK 7 - x64

Scenario	Time (seconds)
Branching - Random data	10.93293813
Branching - Sorted data	5.643797077
Branchless - Random data	3.113581453
Branchless - Sorted data	3.186068823

Observations:

With the Branch: There is a huge difference between the sorted and unsorted data.
With the Hack: There is no difference between sorted and unsorted data.
In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

A general rule of thumb is to avoid data-dependent branching in critical loops (such as in this example).

Update:

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast. This is called "if-conversion" (to branchless) and is necessary for vectorization but also sometimes good for scalar.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2-cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Clang also vectorizes the if() version, as will GCC 5 and later with -O3, even though it takes quite a few instructions to sign-extend to the 64-bit sum on x86 without SSE4 or AVX2. (-march=x86-64-v2 or v3). See Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

ソートされた配列を処理する方が、ソートされていない配列を処理するよりも高速なのはなぜですか? 質問する

ベストアンサー1

分岐予測とは何ですか?

上記から示唆されているように、原因は次の if ステートメントです。

おすすめ記事