What is a "cache-friendly" code? Ask Question

Question

Preliminaries

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

Agner Fog's page彼の優れたドキュメントには、アセンブリから C++ までさまざまな言語を網羅した詳細な例が記載されています。
動画に興味があるなら、ぜひ見てみることをお勧めしますHerb Sutter によるマシンアーキテクチャに関する講演 (youtube)（特に12:00以降をチェックしてください！）。
Christer Ericson によるメモリ最適化に関するスライド（ソニー技術部長）
LWN.netの記事「すべてのプログラマがメモリについて知っておくべきこと」

キャッシュフレンドリーなコードの主な概念

キャッシュフレンドリーなコードの非常に重要な側面は、局所性の原理その目的は、関連するデータをメモリ内で近くに配置し、効率的なキャッシュを可能にすることです。CPU キャッシュに関しては、これがどのように機能するかを理解するためにキャッシュラインを認識することが重要です。キャッシュラインはどのように機能しますか?

キャッシュを最適化するには、次の特定の側面が非常に重要です。

時間的局所性: 特定のメモリ位置にアクセスした場合、近い将来に同じ位置に再度アクセスされる可能性が高くなります。理想的には、この情報はその時点でまだキャッシュされています。
空間的局所性: これは、関連するデータを互いに近くに配置することを意味します。キャッシュは CPU だけでなく、多くのレベルで発生します。たとえば、RAM から読み取る場合、通常、プログラムがそのデータをすぐに必要とすることが多いため、具体的に要求された量よりも大きなメモリチャンクがフェッチされます。HDD キャッシュも同じ考え方に従います。特に CPU キャッシュの場合、キャッシュラインの概念が重要です。

適切なC++ のコンテナ

キャッシュフレンドリーとキャッシュフレンドリーでないの簡単な例は次の通りです。C++ のとのstd::vector比較ですstd::list。の要素std::vectorは連続したメモリに格納されるため、の要素にアクセスする方が、コンテンツがさまざまな場所に格納されているの要素にアクセスするよりもはるかにキャッシュフレンドリーですstd::list。これは空間的な局所性によるものです。

これについては、ビャルネ・ストロストルップが次のように非常にうまく説明している。このユーチューブクリップ(リンクを提供してくれた @Mohammad Ali Baydoun に感謝します)。

データ構造とアルゴリズム設計においてキャッシュを無視しないでください

可能な限り、データ構造と計算の順序を調整して、キャッシュを最大限に活用できるようにしてください。この点で一般的な手法は、キャッシュブロック (Archive.org版)これは、高性能コンピューティングにおいて極めて重要です（例えば、アトラス）。

データの暗黙的な構造を理解し、それを活用する

この分野の多くの人が時々忘れてしまうもう 1 つの簡単な例は、列優先です (例:フォートラン、マトラボ) と行優先順序 (例:c、C++ の) を使用して 2 次元配列を格納します。たとえば、次の行列を考えます。

1 2
3 4

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

Exploiting the ordering (e.g. changing column index first in c++):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.

Avoid unpredictable branches

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?

Avoid virtual functions

In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

Common problems

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): キャッシュラインサイズに合わせる方法とタイミングは?

RAMメモリのキャッシュ不良の極端な症状（おそらくこの文脈ではそうではないでしょうが）は、いわゆる殴打これは、プロセスがディスクアクセスを必要とするページフォールト (現在のページにないメモリにアクセスするなど) を継続的に生成する場合に発生します。

Answer 1

Preliminaries

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

Agner Fog's page彼の優れたドキュメントには、アセンブリから C++ までさまざまな言語を網羅した詳細な例が記載されています。
動画に興味があるなら、ぜひ見てみることをお勧めしますHerb Sutter によるマシンアーキテクチャに関する講演 (youtube)（特に12:00以降をチェックしてください！）。
Christer Ericson によるメモリ最適化に関するスライド（ソニー技術部長）
LWN.netの記事「すべてのプログラマがメモリについて知っておくべきこと」

キャッシュフレンドリーなコードの主な概念

キャッシュフレンドリーなコードの非常に重要な側面は、局所性の原理その目的は、関連するデータをメモリ内で近くに配置し、効率的なキャッシュを可能にすることです。CPU キャッシュに関しては、これがどのように機能するかを理解するためにキャッシュラインを認識することが重要です。キャッシュラインはどのように機能しますか?

キャッシュを最適化するには、次の特定の側面が非常に重要です。

時間的局所性: 特定のメモリ位置にアクセスした場合、近い将来に同じ位置に再度アクセスされる可能性が高くなります。理想的には、この情報はその時点でまだキャッシュされています。
空間的局所性: これは、関連するデータを互いに近くに配置することを意味します。キャッシュは CPU だけでなく、多くのレベルで発生します。たとえば、RAM から読み取る場合、通常、プログラムがそのデータをすぐに必要とすることが多いため、具体的に要求された量よりも大きなメモリチャンクがフェッチされます。HDD キャッシュも同じ考え方に従います。特に CPU キャッシュの場合、キャッシュラインの概念が重要です。

適切なC++ のコンテナ

キャッシュフレンドリーとキャッシュフレンドリーでないの簡単な例は次の通りです。C++ のとのstd::vector比較ですstd::list。の要素std::vectorは連続したメモリに格納されるため、の要素にアクセスする方が、コンテンツがさまざまな場所に格納されているの要素にアクセスするよりもはるかにキャッシュフレンドリーですstd::list。これは空間的な局所性によるものです。

これについては、ビャルネ・ストロストルップが次のように非常にうまく説明している。このユーチューブクリップ(リンクを提供してくれた @Mohammad Ali Baydoun に感謝します)。

データ構造とアルゴリズム設計においてキャッシュを無視しないでください

可能な限り、データ構造と計算の順序を調整して、キャッシュを最大限に活用できるようにしてください。この点で一般的な手法は、キャッシュブロック (Archive.org版)これは、高性能コンピューティングにおいて極めて重要です（例えば、アトラス）。

データの暗黙的な構造を理解し、それを活用する

この分野の多くの人が時々忘れてしまうもう 1 つの簡単な例は、列優先です (例:フォートラン、マトラボ) と行優先順序 (例:c、C++ の) を使用して 2 次元配列を格納します。たとえば、次の行列を考えます。

1 2
3 4

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

Exploiting the ordering (e.g. changing column index first in c++):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.

Avoid unpredictable branches

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?

Avoid virtual functions

In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

Common problems

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): キャッシュラインサイズに合わせる方法とタイミングは?

RAMメモリのキャッシュ不良の極端な症状（おそらくこの文脈ではそうではないでしょうが）は、いわゆる殴打これは、プロセスがディスクアクセスを必要とするページフォールト (現在のページにないメモリにアクセスするなど) を継続的に生成する場合に発生します。

What is a "cache-friendly" code? Ask Question

ベストアンサー1

Preliminaries

キャッシュフレンドリーなコードの主な概念

Common problems

おすすめ記事