指定された列に基づいてCSVから重複エントリを削除する

2024-06-24 • tag-icon

私は次のCSVデータセットで作業しています。

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

これには、削除する必要がある重複エントリがたくさん含まれています（記録の1つのインスタンスを保持）。に基づいてCSVファイルから重複エントリを削除する私はこれを使ってきましたが、sort -u file.csv --o deduped-file.csvこの例では非常に効果的です。

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,

しかし、同様の例は捉えられていません。

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,

データは不完全ですが、同じ内容を表します。

指定されたフィールド（年、メーカー、ブランド、シリーズ、バリアントなど）に基づいて重複項目を削除できますか？

ベストアンサー1

最初の5つのフィールドの「キー」を作成し、キーが最初に表示されたときにのみ1行を印刷します。

awk -F, '
  {key = $1 FS $2 FS $3 FS $4 FS $5}
  !seen[key]++ 
' file

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

ベストアンサー1

おすすめ記事