jsonコンテンツインデックスの一意のIDを生成する

Question

スクリプトが長すぎるのは、uuidgen各行で実行しているからです。cksum各プロセスを開始するだけで多くの時間が無駄になります。

5Mライン形式を{"name": "John%d", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}tmpfsファイルシステム上のファイルに配置するには、次のPythonスクリプトが数秒で完了します。

#! /usr/bin/env python3

import hashlib
import sys
for line in sys.stdin:
    print(hashlib.md5(line.rstrip('\n').encode('utf-8')).hexdigest())

実装する：

$ time ./foo.py < input > output
./foo.py < input > output  6.00s user 0.13s system 99% cpu 6.135 total
% wc -l input output
  5000000 input
  5000000 output
 10000000 total

Pythonなので、行をJSONにデコードして各行にIDを挿入することもできます。次のような非効率的なコードもあります。

#! /usr/bin/env python3

import hashlib
import json
import sys
for line in sys.stdin:
    l = line.rstrip('\n').encode('utf-8')
    o = json.loads(line)
    o["id"] = hashlib.md5(l).hexdigest()
    print(json.dumps(o))

1分で完了：

% time ./foo.py < input > output
./foo.py < input > output  42.11s user 0.42s system 99% cpu 42.600 total

% head output 
{"name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2dc573ccb15679f58abfc44ec8169e52"}
{"name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "ee0583acaf8ad0e502bf5abd29f37edb"}
{"name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "a7352ebb79db8c8fc2cc8758eadd9ea3"}
{"name": "John4", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2062ad1b67ccdce55663bfd523ce1dfb"}
{"name": "John5", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "5f81325c104c01c3e82abd2190f14bcf"}
{"name": "John6", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "493e0c9656f74ec3616e60886ee38e6a"}
{"name": "John7", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "19af9ef2e20466d0fb0efcf03f56d3f6"}
{"name": "John8", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2348bd47b20ac6445213254c6a8aa80b"}
{"name": "John9", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "090a521b4a858705dc69bf9c8dca6c19"}
{"name": "John10", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "fc3c699323cbe399e210e4a191f04003"}

私の仕様：

インテル®Core™i7-8700 CPU @ 3.20GHz×12
2666MHz DDR4メモリ

uuidgen私はあなたのスクリプトに基づいて4分で500,000行をかろうじて管理しました。出力を保存するように変更します。

#!/usr/bin/bash

while IFS= read -r line
do
   uuidgen -s --namespace @dns --name "$line"
done < input > uuid

実装する：

% timeout 240 ./foo.sh
% wc -l uuid
522160 uuid

Answer 1

スクリプトが長すぎるのは、uuidgen各行で実行しているからです。cksum各プロセスを開始するだけで多くの時間が無駄になります。

5Mライン形式を{"name": "John%d", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}tmpfsファイルシステム上のファイルに配置するには、次のPythonスクリプトが数秒で完了します。

#! /usr/bin/env python3

import hashlib
import sys
for line in sys.stdin:
    print(hashlib.md5(line.rstrip('\n').encode('utf-8')).hexdigest())

実装する：

$ time ./foo.py < input > output
./foo.py < input > output  6.00s user 0.13s system 99% cpu 6.135 total
% wc -l input output
  5000000 input
  5000000 output
 10000000 total

Pythonなので、行をJSONにデコードして各行にIDを挿入することもできます。次のような非効率的なコードもあります。

#! /usr/bin/env python3

import hashlib
import json
import sys
for line in sys.stdin:
    l = line.rstrip('\n').encode('utf-8')
    o = json.loads(line)
    o["id"] = hashlib.md5(l).hexdigest()
    print(json.dumps(o))

1分で完了：

% time ./foo.py < input > output
./foo.py < input > output  42.11s user 0.42s system 99% cpu 42.600 total

% head output 
{"name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2dc573ccb15679f58abfc44ec8169e52"}
{"name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "ee0583acaf8ad0e502bf5abd29f37edb"}
{"name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "a7352ebb79db8c8fc2cc8758eadd9ea3"}
{"name": "John4", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2062ad1b67ccdce55663bfd523ce1dfb"}
{"name": "John5", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "5f81325c104c01c3e82abd2190f14bcf"}
{"name": "John6", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "493e0c9656f74ec3616e60886ee38e6a"}
{"name": "John7", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "19af9ef2e20466d0fb0efcf03f56d3f6"}
{"name": "John8", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2348bd47b20ac6445213254c6a8aa80b"}
{"name": "John9", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "090a521b4a858705dc69bf9c8dca6c19"}
{"name": "John10", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "fc3c699323cbe399e210e4a191f04003"}

私の仕様：

インテル®Core™i7-8700 CPU @ 3.20GHz×12
2666MHz DDR4メモリ

uuidgen私はあなたのスクリプトに基づいて4分で500,000行をかろうじて管理しました。出力を保存するように変更します。

#!/usr/bin/bash

while IFS= read -r line
do
   uuidgen -s --namespace @dns --name "$line"
done < input > uuid

実装する：

% timeout 240 ./foo.sh
% wc -l uuid
522160 uuid

jsonコンテンツインデックスの一意のIDを生成する

ベストアンサー1

おすすめ記事