multiprocessing: `chunksize` の背後にあるロジックを理解する質問する

Question

Short Answer

Pool's chunksize-algorithm is a heuristic. It provides a simple solution for all imaginable problem scenarios you are trying to stuff into Pool's methods. As a consequence, it cannot be optimized for any specific scenario.

The algorithm arbitrarily divides the iterable in approximately four times more chunks than the naive approach. More chunks mean more overhead, but increased scheduling flexibility. How this answer will show, this leads to a higher worker-utilization on average, but without the guarantee of a shorter overall computation time for every case.

"That's nice to know" you might think, "but how does knowing this help me with my concrete multiprocessing problems?" Well, it doesn't. The more honest short answer is, "there is no short answer", "multiprocessing is complex" and "it depends". An observed symptom can have different roots, even for similar scenarios.

This answer tries to provide you with basic concepts helping you to get a clearer picture of Pool's scheduling black box. It also tries to give you some basic tools at hand for recognizing and avoiding potential cliffs as far they are related to chunksize.

Table of Contents

Part I

Definitions
Parallelization Goals
Parallelization Scenarios
Risks of Chunksize > 1
Pool's Chunksize-Algorithm
Quantifying Algorithm Efficiency

6.1 Models

6.2 Parallel Schedule

6.3 Efficiencies

6.3.1 Absolute Distribution Efficiency (ADE)

6.3.2 Relative Distribution Efficiency (RDE)

Part II

Naive vs. Pool's Chunksize-Algorithm
Reality Check
Conclusion

It is necessary to clarify some important terms first.

1. Definitions

Chunk

A chunk here is a share of the iterable-argument specified in a pool-method call. How the chunksize gets calculated and what effects this can have, is the topic of this answer.

Task

A task's physical representation in a worker-process in terms of data can be seen in the figure below.

pool.map()この図は、関数から取得したコード行に沿って表示されるの呼び出し例を示しています。このコードmultiprocessing.pool.worker行では、から読み取られたタスクがinqueue展開されます。は、プールワーカープロセスのworkerの基になるメイン関数です。プールメソッドで指定された引数は、を使用したやなどの単一呼び出しメソッドの場合のみ、関数内の変数と一致します。パラメータを持つその他のプールメソッドの場合、処理関数はマッパー関数 (または) になります。この関数は、反復可能オブジェクトの送信されたチャンクのすべての要素に、ユーザーが指定したパラメータをマップします (-->「map-tasks」)。この処理にかかる時間によって、MainThreadfuncfuncworkerapply_asyncimapchunksize=1chunksizefuncmapstarstarmapstarfuncタスクまた、作業単位。

タスクル

「タスク」という言葉の使用は全体1つのチャンクの処理は、コード内のコードによって一致しますがmultiprocessing.pool、単一通話funcチャンクの1つの要素を引数として持つ、ユーザー指定のを参照する必要があります。名前の競合による混乱を避けるため（ maxtasksperchildPoolの__init__-methodの -parameterを考えてみてください）、この回答ではタスク内の単一の作業単位を次のように参照します。タスク。

あタスク（からタスク + エルメントは、タスクこれは、メソッドfuncのパラメータで指定された関数Poolを、単一の要素送信されたかたまり.あタスク構成chunksize タスク。

並列化オーバーヘッド (PO)

POPython 内部のオーバーヘッドとプロセス間通信 (IPC) のオーバーヘッドで構成されます。Python 内のタスクごとのオーバーヘッドには、タスクとその結果のパッケージ化とアンパックに必要なコードが伴います。IPC オーバーヘッドには、必要なスレッドの同期と、異なるアドレス空間間でのデータのコピー (必要なコピー手順は 2 つ: 親 -> キュー -> 子) が伴います。IPC オーバーヘッドの量は OS、ハードウェア、およびデータサイズに依存するため、影響について一般化することは困難です。

2. 並列化の目標

マルチプロセッシングを使用する場合、全体的な目標は（当然ですが）すべてのタスクの合計処理時間を最小化することです。この全体的な目標を達成するために、技術的目標必要となるハードウェアリソースの利用を最適化する。

技術目標を達成するための重要なサブ目標は次のとおりです。

並列化のオーバーヘッドを最小限に抑える（最も有名なのは、しかしそれだけではない：国際PC）
すべてのCPUコアで高い使用率
OSの過剰なページングを防ぐためにメモリ使用量を制限します（破壊する）

まず、タスクは計算的に重い（集中的）である必要があり、取り戻す並列化のために支払う必要があるPO。タスクあたりの絶対計算時間が長くなると、POの重要性は低下します。逆に言えば、絶対計算時間が長くなると、タスクあたり for your problem, the less relevant gets the need for reducing PO. If your computation will take hours per taskel, the IPC overhead will be negligible in comparison. The primary concern here is to prevent idling worker processes after all tasks have been distributed. Keeping all cores loaded means, we are parallelizing as much as possible.

3. Parallelization Scenarios

What factors determine an optimal chunksize argument to methods like multiprocessing.Pool.map()

The major factor in question is how much computation time may vary across our single taskels. To name it, the choice for an optimal chunksize is determined by the Coefficient of Variation (CV) for computation times per taskel.

The two extreme scenarios on a scale, following from the extent of this variation are:

All taskels need exactly the same computation time.
A taskel could take seconds or days to finish.

For better memorability, I will refer to these scenarios as:

Dense Scenario
Wide Scenario

Dense Scenario

In a Dense Scenario it would be desirable to distribute all taskels at once, to keep necessary IPC and context switching at a minimum. This means we want to create only as much chunks, as much worker processes there are. How already stated above, the weight of PO increases with shorter computation times per taskel.

For maximal throughput, we also want all worker processes busy until all tasks are processed (no idling workers). For this goal, the distributed chunks should be of equal size or close to.

Wide Scenario

The prime example for a Wide Scenario would be an optimization problem, where results either converge quickly or computation can take hours, if not days. Usually it is not predictable what mixture of "light taskels" and "heavy taskels" a task will contain in such a case, hence it's not advisable to distribute too many taskels in a task-batch at once. Distributing less taskels at once than possible, means increasing scheduling flexibility. This is needed here to reach our sub-goal of high utilization of all cores.

If Pool methods, by default, would be totally optimized for the Dense Scenario, they would increasingly create suboptimal timings for every problem located closer to the Wide Scenario.

4. Risks of Chunksize > 1

Consider this simplified pseudo-code example of a Wide Scenario-iterable, which we want to pass into a pool-method:

good_luck_iterable = [60, 60, 86400, 60, 86400, 60, 60, 84600]

Instead of the actual values, we pretend to see the needed computation time in seconds, for simplicity only 1 minute or 1 day. We assume the pool has four worker processes (on four cores) and chunksize is set to 2. Because the order will be kept, the chunks send to the workers will be these:

[(60, 60), (86400, 60), (86400, 60), (60, 84600)]

Since we have enough workers and the computation time is high enough, we can say, that every worker process will get a chunk to work on in the first place. (This does not have to be the case for fast completing tasks). Further we can say, the whole processing will take about 86400+60 seconds, because that's the highest total computation time for a chunk in this artificial scenario and we distribute chunks only once.

Now consider this iterable, which has only one element switching its position compared to the previous iterable:

bad_luck_iterable = [60, 60, 86400, 86400, 60, 60, 60, 84600]

...and the corresponding chunks:

[(60, 60), (86400, 86400), (60, 60), (60, 84600)]

Just bad luck with the sorting of our iterable nearly doubled (86400+86400) our total processing time! The worker getting the vicious (86400, 86400)-chunk is blocking the second heavy taskel in its task from getting distributed to one of the idling workers already finished with their (60, 60)-chunks. We obviously would not risk such an unpleasant outcome if we set chunksize=1.

This is the risk of bigger chunksizes. With higher chunksizes we trade scheduling flexibility for less overhead and in cases like above, that's a bad deal.

How we will see in chapter 6. Quantifying Algorithm Efficiency, bigger chunksizes can also lead to suboptimal results for Dense Scenarios.

5. Pool's Chunksize-Algorithm

Below you will find a slightly modified version of the algorithm inside the source code. As you can see, I cut off the lower part and wrapped it into a function for calculating the chunksize argument externally. I also replaced 4 with a factor parameter and outsourced the len() calls.

# mp_utils.py

def calc_chunksize(n_workers, len_iterable, factor=4):
    """Calculate chunksize argument for Pool-methods.

    Resembles source-code within `multiprocessing.pool.Pool._map_async`.
    """
    chunksize, extra = divmod(len_iterable, n_workers * factor)
    if extra:
        chunksize += 1
    return chunksize

To ensure we are all on the same page, here's what divmod does:

divmod(x, y) is a builtin function which returns (x//y, x%y). x // y is the floor division, returning the down rounded quotient from x / y, while x % y is the modulo operation returning the remainder from x / y. Hence e.g. divmod(10, 3) returns (3, 1).

Now when you look at chunksize, extra = divmod(len_iterable, n_workers * 4), you will notice n_workers here is the divisor y in x / y and multiplication by 4, without further adjustment through if extra: chunksize +=1 later on, leads to an initial chunksize at least four times smaller (for len_iterable >= n_workers * 4) than it would be otherwise.

For viewing the effect of multiplication by 4 on the intermediate chunksize result consider this function:

def compare_chunksizes(len_iterable, n_workers=4):
    """Calculate naive chunksize, Pool's stage-1 chunksize and the chunksize
    for Pool's complete algorithm. Return chunksizes and the real factors by
    which naive chunksizes are bigger.
    """
    cs_naive = len_iterable // n_workers or 1  # naive approach
    cs_pool1 = len_iterable // (n_workers * 4) or 1  # incomplete pool algo.
    cs_pool2 = calc_chunksize(n_workers, len_iterable)

    real_factor_pool1 = cs_naive / cs_pool1
    real_factor_pool2 = cs_naive / cs_pool2

    return cs_naive, cs_pool1, cs_pool2, real_factor_pool1, real_factor_pool2

The function above calculates the naive chunksize (cs_naive) and the first-step chunksize of Pool's chunksize-algorithm (cs_pool1), as well as the chunksize for the complete Pool-algorithm (cs_pool2). Further it calculates the real factors rf_pool1 = cs_naive / cs_pool1 and rf_pool2 = cs_naive / cs_pool2, which tell us how many times the naively calculated chunksizes are bigger than Pool's internal version(s).

以下に、この関数の出力で作成された 2 つの図を示します。左の図は、n_workers=4反復可能の長さまでのチャンクサイズを示しています500。右の図はの値を示していますrf_pool1。反復可能の長さの場合16、実数係数は>=4( の場合) となり、反復可能の長さの場合、len_iterable >= n_workers * 4最大値はとなります。これは、より長い反復可能オブジェクトに対してアルゴリズムが収束する元の係数からの大きな偏差です。ここでの「長い」は相対的であり、指定されたワーカーの数によって異なります。728-314

chunksize には、完全なアルゴリズムに含まれる残りの部分との調整がcs_pool1まだ欠けていることに留意してください。extradivmodcs_pool2

アルゴリズムは次のように続きます:

if extra:
    chunksize += 1

今、ケースではは剰余 ( extradivmod 演算から) があるため、チャンクサイズを 1 増やしても、すべてのタスクでうまくいくとは限りません。結局のところ、うまくいくとしたら、そもそも剰余は存在しないはずです。

下の図からわかるように、「追加治療「効果は、本当の要因今のところはからrf_pool2に向かって収束する4下に 4偏差はいくぶん滑らかになります。およびの標準偏差はn_workers=4、ではからではまでlen_iterable=500下がります。0.5233rf_pool10.4115rf_pool2

最終的に、chunksize1 増加すると、最後に送信されるタスクのサイズはだけになりますlen_iterable % chunksize or chunksize。

より興味深いのは、後でより重大な効果がどのように現れるかです。追加治療しかし、生成されたチャンクの数（n_chunks）。十分に長い反復可能オブジェクトの場合、Pool の完了したチャンクサイズアルゴリズム（n_pool2下の図の）は、チャンクの数をで安定させます。対照的に、単純なアルゴリズムは（最初のバープの後）、反復可能オブジェクトの長さが長くなるにつれて、n_chunks == n_workers * 4とを交互に繰り返します。n_chunks == n_workersn_chunks == n_workers + 1

以下に、Pool と単純なチャンクサイズアルゴリズムの 2 つの拡張情報関数を示します。これらの関数の出力は次の章で必要になります。

# mp_utils.py

from collections import namedtuple


Chunkinfo = namedtuple(
    'Chunkinfo', ['n_workers', 'len_iterable', 'n_chunks',
                  'chunksize', 'last_chunk']
)

def calc_chunksize_info(n_workers, len_iterable, factor=4):
    """Calculate chunksize numbers."""
    chunksize, extra = divmod(len_iterable, n_workers * factor)
    if extra:
        chunksize += 1
    # `+ (len_iterable % chunksize > 0)` exploits that `True == 1`
    n_chunks = len_iterable // chunksize + (len_iterable % chunksize > 0)
    # exploit `0 == False`
    last_chunk = len_iterable % chunksize or chunksize

    return Chunkinfo(
        n_workers, len_iterable, n_chunks, chunksize, last_chunk
    )

の予想外の見た目に惑わされないでくださいcalc_naive_chunksize_info。extrafrom はdivmodチャンクサイズの計算には使用されません。

def calc_naive_chunksize_info(n_workers, len_iterable):
    """Calculate naive chunksize numbers."""
    chunksize, extra = divmod(len_iterable, n_workers)
    if chunksize == 0:
        chunksize = 1
        n_chunks = extra
        last_chunk = chunksize
    else:
        n_chunks = len_iterable // chunksize + (len_iterable % chunksize > 0)
        last_chunk = len_iterable % chunksize or chunksize

    return Chunkinfo(
        n_workers, len_iterable, n_chunks, chunksize, last_chunk
    )

6. アルゴリズムの効率を定量化する

Poolさて、の chunksize アルゴリズムの出力が、単純なアルゴリズムの出力と比べてどのように異なるかを確認した後で...

プールのアプローチが実際に改善する何か？
そしてこれは一体何なのだろうか何かなれ？

前の章で示したように、より長い反復可能オブジェクト（タスクの数が多い）の場合、Poolのチャンクサイズアルゴリズムは約反復可能オブジェクトを4回に分割するもっと単純な方法よりもチャンク数が多い。チャンクが小さいほどタスクが多くなり、タスクが多いほど並列化オーバーヘッド (PO)、これはスケジュールの柔軟性の向上によるメリットと比較検討する必要があるコストである（思い出してください「チャンクサイズ>1のリスク」）。

明らかな理由から、Poolの基本的なチャンクサイズアルゴリズムは、スケジュールの柔軟性とPOIPCオーバーヘッドはOS、ハードウェア、データサイズに依存します。アルゴリズムは、どのハードウェアでコードを実行しているかを知ることはできませんし、タスクが完了するまでにどのくらいの時間がかかるかもわかりません。これは、基本的な機能を提供するヒューリスティックです。全て考えられるシナリオ。つまり、特定のシナリオに最適化することはできない。前述のように、POまた、タスクあたりの計算時間が長くなるにつれて、懸念事項はますます少なくなります (負の相関)。

思い出すと並列化の目標第 2 章の箇条書きの 1 つは次のとおりです。

すべてのCPUコアで高い使用率

前述の何か、プールのチャンクサイズアルゴリズムできる改善しようとするのはアイドル状態のワーカープロセスの最小化、それぞれCPUコアの使用率。

SOで繰り返し聞かれる質問は、multiprocessing.Poolすべてのワーカープロセスがビジー状態であると予想される状況で、未使用のコアやアイドル状態のワーカープロセスについて疑問に思う人々からのものです。これには多くの理由が考えられますが、計算の終わりに向かってアイドル状態のワーカープロセスが見られることは、濃密なシナリオ（タスクあたりの計算時間は等しい）ワーカー数が一定でない場合除数チャンクの数（n_chunks % n_workers > 0）。

今の疑問は次の通りです。

チャンクサイズに関する理解を、観察されたワーカー使用率を説明できるもの、あるいはその点で異なるアルゴリズムの効率を比較できるものに実際的に変換するにはどうすればよいでしょうか。

6.1 モデル

ここでより深い洞察を得るためには、定義された境界内で重要性を保ちながら、過度に複雑な現実を扱いやすい程度の複雑さにまで単純化する並列計算の抽象化が必要です。このような抽象化はモデル. このような「並列化モデル（PM）データが収集された場合、実際の計算と同様にワーカーにマップされたメタデータ (タイムスタンプ) を生成します。モデルによって生成されたメタデータにより、特定の制約の下で並列計算のメトリックを予測できます。

ここで定義される2つのサブモデルのうちの1つ午後それは配布モデル (DM)。DM作業の原子単位（タスク）がどのように分散されるかを説明します並行作業者と時間それぞれのチャンクサイズアルゴリズム、ワーカー数、入力反復可能オブジェクト（タスク数）、およびそれらの計算時間以外の要素が考慮されていない場合。つまり、オーバーヘッドはない含まれています。

完全な午後、DMは、オーバーヘッドモデル (OM)、様々な形態の並列化オーバーヘッド (PO)このようなモデルは、各ノードごとに個別に調整する必要があります（ハードウェア、OSの依存性）。オーエム開いたままなので複数のOMsさまざまなレベルの複雑さが存在する可能性があります。実装された精度のレベルはオーエム needs is determined by the overall weight of PO for the specific computation. Shorter taskels lead to a higher weight of PO, which in turn requires a more precise OM if we were attempting to predict Parallelization Efficiencies (PE).

6.2 Parallel Schedule (PS)

The Parallel Schedule is a two-dimensional representation of the parallel computation, where the x-axis represents time and the y-axis represents a pool of parallel workers. The number of workers and the total computation time mark the extend of a rectangle, in which smaller rectangles are drawn in. These smaller rectangles represent atomic units of work (taskels).

Below you find the visualization of a PS drawn with data from the DM of Pool's chunksize-algorithm for the Dense Scenario.

The x-axis is sectioned into equal units of time, where each unit stands for the computation time a taskel requires.
The y-axis is divided into the number of worker-processes the pool uses.
A taskel here is displayed as the smallest cyan-colored rectangle, put into a timeline (a schedule) of an anonymized worker-process.
A task is one or multiple taskels in a worker-timeline continuously highlighted with the same hue.
Idling time units are represented through red colored tiles.
The Parallel Schedule is partitioned into sections. The last section is the tail-section.

The names for the composed parts can be seen in the picture below.

In a complete PM including an OM, the Idling Share is not limited to the tail, but also comprises space between tasks and even between taskels.

6.3 Efficiencies

The Models introduced above allow quantifying the rate of worker-utilization. We can distinguish:

Distribution Efficiency (DE) - calculated with help of a DM (or a simplified method for the Dense Scenario).
Parallelization Efficiency (PE) - either calculated with help of a calibrated PM (prediction) or calculated from meta-data of real computations.

It's important to note, that calculated efficiencies do not automatically correlate with faster overall computation for a given parallelization problem. Worker-utilization in this context only distinguishes between a worker having a started, yet unfinished taskel and a worker not having such an "open" taskel. That means, possible idling during the time span of a taskel is not registered.

All above mentioned efficiencies are basically obtained by calculating the quotient of the division Busy Share / Parallel Schedule. The difference between DE and PE comes with the Busy Share occupying a smaller portion of the overall Parallel Schedule for the overhead-extended PM.

This answer will further only discuss a simple method to calculate DE for the Dense Scenario. This is sufficiently adequate to compare different chunksize-algorithms, since...

... the DM is the part of the PM, which changes with different chunksize-algorithms employed.
... the Dense Scenarioタスクごとに計算期間が等しい場合は「安定した状態」を表し、これらの時間範囲は方程式から除外されます。他のシナリオでは、タスクの順序が重要になるため、ランダムな結果になります。

6.3.1 絶対分布効率（ADE）

この基本効率は、一般的にはビジーシェア全体の潜在能力を通じて並行スケジュール:

絶対分布効率（ADE）=ビジーシェア/並行スケジュール

のために密なシナリオ簡略化された計算コードは次のようになります。

# mp_utils.py

def calc_ade(n_workers, len_iterable, n_chunks, chunksize, last_chunk):
    """Calculate Absolute Distribution Efficiency (ADE).

    `len_iterable` is not used, but contained to keep a consistent signature
    with `calc_rde`.
    """
    if n_workers == 1:
        return 1

    potential = (
        ((n_chunks // n_workers + (n_chunks % n_workers > 1)) * chunksize)
        + (n_chunks % n_workers == 1) * last_chunk
    ) * n_workers

    n_full_chunks = n_chunks - (chunksize > last_chunk)
    taskels_in_regular_chunks = n_full_chunks * chunksize
    real = taskels_in_regular_chunks + (chunksize > last_chunk) * last_chunk
    ade = real / potential

    return ade

ない場合はアイドリングシェア、ビジーシェアだろう等しいに並行スケジュール、したがって、ADE100% です。この簡略化されたモデルでは、これは、すべてのタスクの処理に必要な時間全体にわたって、利用可能なすべてのプロセスがビジー状態になるシナリオです。言い換えると、ジョブ全体が実質的に 100% 並列化されます。

しかし、なぜ私は体育として絶対体育ここ？

それを理解するには、最大限のスケジュール柔軟性を保証するチャンクサイズ (cs) の可能なケースを考慮する必要があります (また、ハイランダーの数も考慮する必要があります。偶然でしょうか?):

__________________________________~ 1 ~__________________________________

たとえば、ワーカープロセスが 4 つ、タスクが 37 個ある場合、は37 の約数ではないchunksize=1ため、であってもアイドル状態のワーカーが存在しますn_workers=4。 37 を 4 で割った余りは 1 です。この残った 1 つのタスクは、1 人のワーカーによって処理される必要があり、残りの 3 つはアイドル状態になります。

同様に、以下の図に示すように、39 個のタスクを持つアイドル状態のワーカーが 1 人残ります。

上段と下段を比較すると並行スケジュールchunksize=1の以下のバージョンでは、chunksize=3上部の並行スケジュールチャンクサイズが小さいほど、X軸のタイムラインは短くなります。これで、チャンクサイズが大きくなると予想外にできる全体的な計算時間が長くなるため、濃密なシナリオ。

しかし、効率計算にはなぜ x 軸の長さだけを使用しないのでしょうか?

このモデルにはオーバーヘッドが含まれていないため、両方のチャンクサイズで異なるため、x軸を直接比較することはできません。オーバーヘッドにより、図に示すように、計算時間が長くなる可能性があります。ケース2下の図から。

6.3.2 相対分布効率（RDE）

のADE値に情報が含まれない場合、より良いチャンクサイズを 1 に設定すると、タスクの分散が可能になります。より良いここでもまだ小さいアイドリングシェア。

取得するにはドイツ可能な限り最大限に調整された値ドイツ、我々は考慮したADEを通ってADEを取得しますchunksize=1。

相対的分配効率（RDE）=ADE_cs_x/ADE_cs_1

コードでは次のようになります。

# mp_utils.py

def calc_rde(n_workers, len_iterable, n_chunks, chunksize, last_chunk):
    """Calculate Relative Distribution Efficiency (RDE)."""
    ade_cs1 = calc_ade(
        n_workers, len_iterable, n_chunks=len_iterable,
        chunksize=1, last_chunk=1
    )
    ade = calc_ade(n_workers, len_iterable, n_chunks, chunksize, last_chunk)
    rde = ade / ade_cs1

    return rde

RDEここでどのように定義されているかは、本質的には、並行スケジュール。RDE末尾に含まれる最大有効チャンクサイズによって影響を受けます。（この末尾はx軸の長さchunksizeまたはになりますlast_chunk。）この結果、RDE下の図に示すように、あらゆる種類の「テールルック」は自然に 100% (均等) に収束します。

低いRDE...

最適化の可能性を示す強力なヒントです。
当然、長い反復可能オブジェクトでは全体の相対的な末尾部分が少なくなるため、並行スケジュール縮みます。

この回答のパートIIをご覧くださいここ。

Answer 1

Short Answer

Pool's chunksize-algorithm is a heuristic. It provides a simple solution for all imaginable problem scenarios you are trying to stuff into Pool's methods. As a consequence, it cannot be optimized for any specific scenario.

The algorithm arbitrarily divides the iterable in approximately four times more chunks than the naive approach. More chunks mean more overhead, but increased scheduling flexibility. How this answer will show, this leads to a higher worker-utilization on average, but without the guarantee of a shorter overall computation time for every case.

"That's nice to know" you might think, "but how does knowing this help me with my concrete multiprocessing problems?" Well, it doesn't. The more honest short answer is, "there is no short answer", "multiprocessing is complex" and "it depends". An observed symptom can have different roots, even for similar scenarios.

This answer tries to provide you with basic concepts helping you to get a clearer picture of Pool's scheduling black box. It also tries to give you some basic tools at hand for recognizing and avoiding potential cliffs as far they are related to chunksize.

Table of Contents

Part I

Definitions
Parallelization Goals
Parallelization Scenarios
Risks of Chunksize > 1
Pool's Chunksize-Algorithm
Quantifying Algorithm Efficiency

6.1 Models

6.2 Parallel Schedule

6.3 Efficiencies

6.3.1 Absolute Distribution Efficiency (ADE)

6.3.2 Relative Distribution Efficiency (RDE)

Part II

Naive vs. Pool's Chunksize-Algorithm
Reality Check
Conclusion

It is necessary to clarify some important terms first.

1. Definitions

Chunk

A chunk here is a share of the iterable-argument specified in a pool-method call. How the chunksize gets calculated and what effects this can have, is the topic of this answer.

Task

A task's physical representation in a worker-process in terms of data can be seen in the figure below.

pool.map()この図は、関数から取得したコード行に沿って表示されるの呼び出し例を示しています。このコードmultiprocessing.pool.worker行では、から読み取られたタスクがinqueue展開されます。は、プールワーカープロセスのworkerの基になるメイン関数です。プールメソッドで指定された引数は、を使用したやなどの単一呼び出しメソッドの場合のみ、関数内の変数と一致します。パラメータを持つその他のプールメソッドの場合、処理関数はマッパー関数 (または) になります。この関数は、反復可能オブジェクトの送信されたチャンクのすべての要素に、ユーザーが指定したパラメータをマップします (-->「map-tasks」)。この処理にかかる時間によって、MainThreadfuncfuncworkerapply_asyncimapchunksize=1chunksizefuncmapstarstarmapstarfuncタスクまた、作業単位。

タスクル

「タスク」という言葉の使用は全体1つのチャンクの処理は、コード内のコードによって一致しますがmultiprocessing.pool、単一通話funcチャンクの1つの要素を引数として持つ、ユーザー指定のを参照する必要があります。名前の競合による混乱を避けるため（ maxtasksperchildPoolの__init__-methodの -parameterを考えてみてください）、この回答ではタスク内の単一の作業単位を次のように参照します。タスク。

あタスク（からタスク + エルメントは、タスクこれは、メソッドfuncのパラメータで指定された関数Poolを、単一の要素送信されたかたまり.あタスク構成chunksize タスク。

並列化オーバーヘッド (PO)

POPython 内部のオーバーヘッドとプロセス間通信 (IPC) のオーバーヘッドで構成されます。Python 内のタスクごとのオーバーヘッドには、タスクとその結果のパッケージ化とアンパックに必要なコードが伴います。IPC オーバーヘッドには、必要なスレッドの同期と、異なるアドレス空間間でのデータのコピー (必要なコピー手順は 2 つ: 親 -> キュー -> 子) が伴います。IPC オーバーヘッドの量は OS、ハードウェア、およびデータサイズに依存するため、影響について一般化することは困難です。

2. 並列化の目標

マルチプロセッシングを使用する場合、全体的な目標は（当然ですが）すべてのタスクの合計処理時間を最小化することです。この全体的な目標を達成するために、技術的目標必要となるハードウェアリソースの利用を最適化する。

技術目標を達成するための重要なサブ目標は次のとおりです。

並列化のオーバーヘッドを最小限に抑える（最も有名なのは、しかしそれだけではない：国際PC）
すべてのCPUコアで高い使用率
OSの過剰なページングを防ぐためにメモリ使用量を制限します（破壊する）

まず、タスクは計算的に重い（集中的）である必要があり、取り戻す並列化のために支払う必要があるPO。タスクあたりの絶対計算時間が長くなると、POの重要性は低下します。逆に言えば、絶対計算時間が長くなると、タスクあたり for your problem, the less relevant gets the need for reducing PO. If your computation will take hours per taskel, the IPC overhead will be negligible in comparison. The primary concern here is to prevent idling worker processes after all tasks have been distributed. Keeping all cores loaded means, we are parallelizing as much as possible.

3. Parallelization Scenarios

What factors determine an optimal chunksize argument to methods like multiprocessing.Pool.map()

The major factor in question is how much computation time may vary across our single taskels. To name it, the choice for an optimal chunksize is determined by the Coefficient of Variation (CV) for computation times per taskel.

The two extreme scenarios on a scale, following from the extent of this variation are:

All taskels need exactly the same computation time.
A taskel could take seconds or days to finish.

For better memorability, I will refer to these scenarios as:

Dense Scenario
Wide Scenario

Dense Scenario

In a Dense Scenario it would be desirable to distribute all taskels at once, to keep necessary IPC and context switching at a minimum. This means we want to create only as much chunks, as much worker processes there are. How already stated above, the weight of PO increases with shorter computation times per taskel.

For maximal throughput, we also want all worker processes busy until all tasks are processed (no idling workers). For this goal, the distributed chunks should be of equal size or close to.

Wide Scenario

The prime example for a Wide Scenario would be an optimization problem, where results either converge quickly or computation can take hours, if not days. Usually it is not predictable what mixture of "light taskels" and "heavy taskels" a task will contain in such a case, hence it's not advisable to distribute too many taskels in a task-batch at once. Distributing less taskels at once than possible, means increasing scheduling flexibility. This is needed here to reach our sub-goal of high utilization of all cores.

If Pool methods, by default, would be totally optimized for the Dense Scenario, they would increasingly create suboptimal timings for every problem located closer to the Wide Scenario.

4. Risks of Chunksize > 1

Consider this simplified pseudo-code example of a Wide Scenario-iterable, which we want to pass into a pool-method:

good_luck_iterable = [60, 60, 86400, 60, 86400, 60, 60, 84600]

Instead of the actual values, we pretend to see the needed computation time in seconds, for simplicity only 1 minute or 1 day. We assume the pool has four worker processes (on four cores) and chunksize is set to 2. Because the order will be kept, the chunks send to the workers will be these:

[(60, 60), (86400, 60), (86400, 60), (60, 84600)]

Since we have enough workers and the computation time is high enough, we can say, that every worker process will get a chunk to work on in the first place. (This does not have to be the case for fast completing tasks). Further we can say, the whole processing will take about 86400+60 seconds, because that's the highest total computation time for a chunk in this artificial scenario and we distribute chunks only once.

Now consider this iterable, which has only one element switching its position compared to the previous iterable:

bad_luck_iterable = [60, 60, 86400, 86400, 60, 60, 60, 84600]

...and the corresponding chunks:

[(60, 60), (86400, 86400), (60, 60), (60, 84600)]

Just bad luck with the sorting of our iterable nearly doubled (86400+86400) our total processing time! The worker getting the vicious (86400, 86400)-chunk is blocking the second heavy taskel in its task from getting distributed to one of the idling workers already finished with their (60, 60)-chunks. We obviously would not risk such an unpleasant outcome if we set chunksize=1.

This is the risk of bigger chunksizes. With higher chunksizes we trade scheduling flexibility for less overhead and in cases like above, that's a bad deal.

How we will see in chapter 6. Quantifying Algorithm Efficiency, bigger chunksizes can also lead to suboptimal results for Dense Scenarios.

5. Pool's Chunksize-Algorithm

Below you will find a slightly modified version of the algorithm inside the source code. As you can see, I cut off the lower part and wrapped it into a function for calculating the chunksize argument externally. I also replaced 4 with a factor parameter and outsourced the len() calls.

# mp_utils.py

def calc_chunksize(n_workers, len_iterable, factor=4):
    """Calculate chunksize argument for Pool-methods.

    Resembles source-code within `multiprocessing.pool.Pool._map_async`.
    """
    chunksize, extra = divmod(len_iterable, n_workers * factor)
    if extra:
        chunksize += 1
    return chunksize

To ensure we are all on the same page, here's what divmod does:

divmod(x, y) is a builtin function which returns (x//y, x%y). x // y is the floor division, returning the down rounded quotient from x / y, while x % y is the modulo operation returning the remainder from x / y. Hence e.g. divmod(10, 3) returns (3, 1).

Now when you look at chunksize, extra = divmod(len_iterable, n_workers * 4), you will notice n_workers here is the divisor y in x / y and multiplication by 4, without further adjustment through if extra: chunksize +=1 later on, leads to an initial chunksize at least four times smaller (for len_iterable >= n_workers * 4) than it would be otherwise.

For viewing the effect of multiplication by 4 on the intermediate chunksize result consider this function:

def compare_chunksizes(len_iterable, n_workers=4):
    """Calculate naive chunksize, Pool's stage-1 chunksize and the chunksize
    for Pool's complete algorithm. Return chunksizes and the real factors by
    which naive chunksizes are bigger.
    """
    cs_naive = len_iterable // n_workers or 1  # naive approach
    cs_pool1 = len_iterable // (n_workers * 4) or 1  # incomplete pool algo.
    cs_pool2 = calc_chunksize(n_workers, len_iterable)

    real_factor_pool1 = cs_naive / cs_pool1
    real_factor_pool2 = cs_naive / cs_pool2

    return cs_naive, cs_pool1, cs_pool2, real_factor_pool1, real_factor_pool2

The function above calculates the naive chunksize (cs_naive) and the first-step chunksize of Pool's chunksize-algorithm (cs_pool1), as well as the chunksize for the complete Pool-algorithm (cs_pool2). Further it calculates the real factors rf_pool1 = cs_naive / cs_pool1 and rf_pool2 = cs_naive / cs_pool2, which tell us how many times the naively calculated chunksizes are bigger than Pool's internal version(s).

以下に、この関数の出力で作成された 2 つの図を示します。左の図は、n_workers=4反復可能の長さまでのチャンクサイズを示しています500。右の図はの値を示していますrf_pool1。反復可能の長さの場合16、実数係数は>=4( の場合) となり、反復可能の長さの場合、len_iterable >= n_workers * 4最大値はとなります。これは、より長い反復可能オブジェクトに対してアルゴリズムが収束する元の係数からの大きな偏差です。ここでの「長い」は相対的であり、指定されたワーカーの数によって異なります。728-314

chunksize には、完全なアルゴリズムに含まれる残りの部分との調整がcs_pool1まだ欠けていることに留意してください。extradivmodcs_pool2

アルゴリズムは次のように続きます:

if extra:
    chunksize += 1

今、ケースではは剰余 ( extradivmod 演算から) があるため、チャンクサイズを 1 増やしても、すべてのタスクでうまくいくとは限りません。結局のところ、うまくいくとしたら、そもそも剰余は存在しないはずです。

下の図からわかるように、「追加治療「効果は、本当の要因今のところはからrf_pool2に向かって収束する4下に 4偏差はいくぶん滑らかになります。およびの標準偏差はn_workers=4、ではからではまでlen_iterable=500下がります。0.5233rf_pool10.4115rf_pool2

最終的に、chunksize1 増加すると、最後に送信されるタスクのサイズはだけになりますlen_iterable % chunksize or chunksize。

より興味深いのは、後でより重大な効果がどのように現れるかです。追加治療しかし、生成されたチャンクの数（n_chunks）。十分に長い反復可能オブジェクトの場合、Pool の完了したチャンクサイズアルゴリズム（n_pool2下の図の）は、チャンクの数をで安定させます。対照的に、単純なアルゴリズムは（最初のバープの後）、反復可能オブジェクトの長さが長くなるにつれて、n_chunks == n_workers * 4とを交互に繰り返します。n_chunks == n_workersn_chunks == n_workers + 1

以下に、Pool と単純なチャンクサイズアルゴリズムの 2 つの拡張情報関数を示します。これらの関数の出力は次の章で必要になります。

# mp_utils.py

from collections import namedtuple


Chunkinfo = namedtuple(
    'Chunkinfo', ['n_workers', 'len_iterable', 'n_chunks',
                  'chunksize', 'last_chunk']
)

def calc_chunksize_info(n_workers, len_iterable, factor=4):
    """Calculate chunksize numbers."""
    chunksize, extra = divmod(len_iterable, n_workers * factor)
    if extra:
        chunksize += 1
    # `+ (len_iterable % chunksize > 0)` exploits that `True == 1`
    n_chunks = len_iterable // chunksize + (len_iterable % chunksize > 0)
    # exploit `0 == False`
    last_chunk = len_iterable % chunksize or chunksize

    return Chunkinfo(
        n_workers, len_iterable, n_chunks, chunksize, last_chunk
    )

の予想外の見た目に惑わされないでくださいcalc_naive_chunksize_info。extrafrom はdivmodチャンクサイズの計算には使用されません。

def calc_naive_chunksize_info(n_workers, len_iterable):
    """Calculate naive chunksize numbers."""
    chunksize, extra = divmod(len_iterable, n_workers)
    if chunksize == 0:
        chunksize = 1
        n_chunks = extra
        last_chunk = chunksize
    else:
        n_chunks = len_iterable // chunksize + (len_iterable % chunksize > 0)
        last_chunk = len_iterable % chunksize or chunksize

    return Chunkinfo(
        n_workers, len_iterable, n_chunks, chunksize, last_chunk
    )

6. アルゴリズムの効率を定量化する

Poolさて、の chunksize アルゴリズムの出力が、単純なアルゴリズムの出力と比べてどのように異なるかを確認した後で...

プールのアプローチが実際に改善する何か？
そしてこれは一体何なのだろうか何かなれ？

前の章で示したように、より長い反復可能オブジェクト（タスクの数が多い）の場合、Poolのチャンクサイズアルゴリズムは約反復可能オブジェクトを4回に分割するもっと単純な方法よりもチャンク数が多い。チャンクが小さいほどタスクが多くなり、タスクが多いほど並列化オーバーヘッド (PO)、これはスケジュールの柔軟性の向上によるメリットと比較検討する必要があるコストである（思い出してください「チャンクサイズ>1のリスク」）。

明らかな理由から、Poolの基本的なチャンクサイズアルゴリズムは、スケジュールの柔軟性とPOIPCオーバーヘッドはOS、ハードウェア、データサイズに依存します。アルゴリズムは、どのハードウェアでコードを実行しているかを知ることはできませんし、タスクが完了するまでにどのくらいの時間がかかるかもわかりません。これは、基本的な機能を提供するヒューリスティックです。全て考えられるシナリオ。つまり、特定のシナリオに最適化することはできない。前述のように、POまた、タスクあたりの計算時間が長くなるにつれて、懸念事項はますます少なくなります (負の相関)。

思い出すと並列化の目標第 2 章の箇条書きの 1 つは次のとおりです。

すべてのCPUコアで高い使用率

前述の何か、プールのチャンクサイズアルゴリズムできる改善しようとするのはアイドル状態のワーカープロセスの最小化、それぞれCPUコアの使用率。

SOで繰り返し聞かれる質問は、multiprocessing.Poolすべてのワーカープロセスがビジー状態であると予想される状況で、未使用のコアやアイドル状態のワーカープロセスについて疑問に思う人々からのものです。これには多くの理由が考えられますが、計算の終わりに向かってアイドル状態のワーカープロセスが見られることは、濃密なシナリオ（タスクあたりの計算時間は等しい）ワーカー数が一定でない場合除数チャンクの数（n_chunks % n_workers > 0）。

今の疑問は次の通りです。

チャンクサイズに関する理解を、観察されたワーカー使用率を説明できるもの、あるいはその点で異なるアルゴリズムの効率を比較できるものに実際的に変換するにはどうすればよいでしょうか。

6.1 モデル

ここでより深い洞察を得るためには、定義された境界内で重要性を保ちながら、過度に複雑な現実を扱いやすい程度の複雑さにまで単純化する並列計算の抽象化が必要です。このような抽象化はモデル. このような「並列化モデル（PM）データが収集された場合、実際の計算と同様にワーカーにマップされたメタデータ (タイムスタンプ) を生成します。モデルによって生成されたメタデータにより、特定の制約の下で並列計算のメトリックを予測できます。

ここで定義される2つのサブモデルのうちの1つ午後それは配布モデル (DM)。DM作業の原子単位（タスク）がどのように分散されるかを説明します並行作業者と時間それぞれのチャンクサイズアルゴリズム、ワーカー数、入力反復可能オブジェクト（タスク数）、およびそれらの計算時間以外の要素が考慮されていない場合。つまり、オーバーヘッドはない含まれています。

完全な午後、DMは、オーバーヘッドモデル (OM)、様々な形態の並列化オーバーヘッド (PO)このようなモデルは、各ノードごとに個別に調整する必要があります（ハードウェア、OSの依存性）。オーエム開いたままなので複数のOMsさまざまなレベルの複雑さが存在する可能性があります。実装された精度のレベルはオーエム needs is determined by the overall weight of PO for the specific computation. Shorter taskels lead to a higher weight of PO, which in turn requires a more precise OM if we were attempting to predict Parallelization Efficiencies (PE).

6.2 Parallel Schedule (PS)

The Parallel Schedule is a two-dimensional representation of the parallel computation, where the x-axis represents time and the y-axis represents a pool of parallel workers. The number of workers and the total computation time mark the extend of a rectangle, in which smaller rectangles are drawn in. These smaller rectangles represent atomic units of work (taskels).

Below you find the visualization of a PS drawn with data from the DM of Pool's chunksize-algorithm for the Dense Scenario.

The x-axis is sectioned into equal units of time, where each unit stands for the computation time a taskel requires.
The y-axis is divided into the number of worker-processes the pool uses.
A taskel here is displayed as the smallest cyan-colored rectangle, put into a timeline (a schedule) of an anonymized worker-process.
A task is one or multiple taskels in a worker-timeline continuously highlighted with the same hue.
Idling time units are represented through red colored tiles.
The Parallel Schedule is partitioned into sections. The last section is the tail-section.

The names for the composed parts can be seen in the picture below.

In a complete PM including an OM, the Idling Share is not limited to the tail, but also comprises space between tasks and even between taskels.

6.3 Efficiencies

The Models introduced above allow quantifying the rate of worker-utilization. We can distinguish:

Distribution Efficiency (DE) - calculated with help of a DM (or a simplified method for the Dense Scenario).
Parallelization Efficiency (PE) - either calculated with help of a calibrated PM (prediction) or calculated from meta-data of real computations.

It's important to note, that calculated efficiencies do not automatically correlate with faster overall computation for a given parallelization problem. Worker-utilization in this context only distinguishes between a worker having a started, yet unfinished taskel and a worker not having such an "open" taskel. That means, possible idling during the time span of a taskel is not registered.

All above mentioned efficiencies are basically obtained by calculating the quotient of the division Busy Share / Parallel Schedule. The difference between DE and PE comes with the Busy Share occupying a smaller portion of the overall Parallel Schedule for the overhead-extended PM.

This answer will further only discuss a simple method to calculate DE for the Dense Scenario. This is sufficiently adequate to compare different chunksize-algorithms, since...

... the DM is the part of the PM, which changes with different chunksize-algorithms employed.
... the Dense Scenarioタスクごとに計算期間が等しい場合は「安定した状態」を表し、これらの時間範囲は方程式から除外されます。他のシナリオでは、タスクの順序が重要になるため、ランダムな結果になります。

6.3.1 絶対分布効率（ADE）

この基本効率は、一般的にはビジーシェア全体の潜在能力を通じて並行スケジュール:

絶対分布効率（ADE）=ビジーシェア/並行スケジュール

のために密なシナリオ簡略化された計算コードは次のようになります。

# mp_utils.py

def calc_ade(n_workers, len_iterable, n_chunks, chunksize, last_chunk):
    """Calculate Absolute Distribution Efficiency (ADE).

    `len_iterable` is not used, but contained to keep a consistent signature
    with `calc_rde`.
    """
    if n_workers == 1:
        return 1

    potential = (
        ((n_chunks // n_workers + (n_chunks % n_workers > 1)) * chunksize)
        + (n_chunks % n_workers == 1) * last_chunk
    ) * n_workers

    n_full_chunks = n_chunks - (chunksize > last_chunk)
    taskels_in_regular_chunks = n_full_chunks * chunksize
    real = taskels_in_regular_chunks + (chunksize > last_chunk) * last_chunk
    ade = real / potential

    return ade

ない場合はアイドリングシェア、ビジーシェアだろう等しいに並行スケジュール、したがって、ADE100% です。この簡略化されたモデルでは、これは、すべてのタスクの処理に必要な時間全体にわたって、利用可能なすべてのプロセスがビジー状態になるシナリオです。言い換えると、ジョブ全体が実質的に 100% 並列化されます。

しかし、なぜ私は体育として絶対体育ここ？

それを理解するには、最大限のスケジュール柔軟性を保証するチャンクサイズ (cs) の可能なケースを考慮する必要があります (また、ハイランダーの数も考慮する必要があります。偶然でしょうか?):

__________________________________~ 1 ~__________________________________

たとえば、ワーカープロセスが 4 つ、タスクが 37 個ある場合、は37 の約数ではないchunksize=1ため、であってもアイドル状態のワーカーが存在しますn_workers=4。 37 を 4 で割った余りは 1 です。この残った 1 つのタスクは、1 人のワーカーによって処理される必要があり、残りの 3 つはアイドル状態になります。

同様に、以下の図に示すように、39 個のタスクを持つアイドル状態のワーカーが 1 人残ります。

上段と下段を比較すると並行スケジュールchunksize=1の以下のバージョンでは、chunksize=3上部の並行スケジュールチャンクサイズが小さいほど、X軸のタイムラインは短くなります。これで、チャンクサイズが大きくなると予想外にできる全体的な計算時間が長くなるため、濃密なシナリオ。

しかし、効率計算にはなぜ x 軸の長さだけを使用しないのでしょうか?

このモデルにはオーバーヘッドが含まれていないため、両方のチャンクサイズで異なるため、x軸を直接比較することはできません。オーバーヘッドにより、図に示すように、計算時間が長くなる可能性があります。ケース2下の図から。

6.3.2 相対分布効率（RDE）

のADE値に情報が含まれない場合、より良いチャンクサイズを 1 に設定すると、タスクの分散が可能になります。より良いここでもまだ小さいアイドリングシェア。

取得するにはドイツ可能な限り最大限に調整された値ドイツ、我々は考慮したADEを通ってADEを取得しますchunksize=1。

相対的分配効率（RDE）=ADE_cs_x/ADE_cs_1

コードでは次のようになります。

# mp_utils.py

def calc_rde(n_workers, len_iterable, n_chunks, chunksize, last_chunk):
    """Calculate Relative Distribution Efficiency (RDE)."""
    ade_cs1 = calc_ade(
        n_workers, len_iterable, n_chunks=len_iterable,
        chunksize=1, last_chunk=1
    )
    ade = calc_ade(n_workers, len_iterable, n_chunks, chunksize, last_chunk)
    rde = ade / ade_cs1

    return rde

RDEここでどのように定義されているかは、本質的には、並行スケジュール。RDE末尾に含まれる最大有効チャンクサイズによって影響を受けます。（この末尾はx軸の長さchunksizeまたはになりますlast_chunk。）この結果、RDE下の図に示すように、あらゆる種類の「テールルック」は自然に 100% (均等) に収束します。

低いRDE...

最適化の可能性を示す強力なヒントです。
当然、長い反復可能オブジェクトでは全体の相対的な末尾部分が少なくなるため、並行スケジュール縮みます。

この回答のパートIIをご覧くださいここ。

multiprocessing: `chunksize` の背後にあるロジックを理解する質問する

ベストアンサー1

Short Answer

1. Definitions

2. 並列化の目標

3. Parallelization Scenarios

Dense Scenario

Wide Scenario

4. Risks of Chunksize > 1

5. Pool's Chunksize-Algorithm

6. アルゴリズムの効率を定量化する

6.1 モデル

6.2 Parallel Schedule (PS)

6.3 Efficiencies

6.3.1 絶対分布効率（ADE）

6.3.2 相対分布効率（RDE）

おすすめ記事