Pandasで非常に大きなデータフレームにピボットテーブルを作成する方法質問する

Question

HDF5/pytables を使用して追加を行うことができます。これにより、RAM を節約できます。

store = pd.HDFStore('store.h5')
for ...:
    ...
    chunk  # the chunk of the DataFrame (which you want to append)
    store.append('df', chunk)

これで、これを DataFrame として一度に読み込むことができます (この DataFrame がメモリに収まると仮定した場合)。

df = store['df']

クエリを実行して、DataFrame のサブセクションのみを取得することもできます。

余談ですが、RAM も追加で購入したほうがいいですよ。安いですから。

編集: ストアからグループ化/合計することができます繰り返しこれはチャンクに対して「マップ削減」を行うため、

# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()

編集2: 上記のようにsumを使用すると、実際にはpandas 0.16では機能しません（0.15.2では機能すると思っていました）。代わりに、reduceとadd:

reduce(lambda x, y: x.add(y, fill_value=0),
       (df.groupby().sum() for df in store.select('df', chunksize=50000)))

Python 3ではfunctools からreduceをインポートする。

おそらく、次のように書くほうが Python らしくて読みやすいでしょう。

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks)  # will raise if there are no chunks!
for c in chunks:
    res = res.add(c, fill_value=0)

パフォーマンスが悪い場合、または新しいグループの数が多い場合は、正しいサイズのゼロとしてリソースを開始し (チャンクをループするなどして一意のグループキーを取得することによって)、その場で追加することが望ましい場合があります。

Answer 1