grepでテキストを処理するときに行番号を保持する

2024-06-11 • tag-icon

テキストを機械学習ソフトウェアに送信するためにテキストを前処理するために、要求が厳しく複雑な手順がある。

簡単に言うと：

Bashスクリプトは、何千ものテキストファイルが待機しているフォルダに移動し、CATで開き、不要な行をクリーンアップして削除し、後で手動スキャンのために機械学習プロセスにファイルを送信する前に、一部の情報を含むCSVをディスクに書き込みます。。

内容に加えて、単語が表示される順序は機械学習プロセスの中心であるため、行番号を保存することも非常に重要です。

だから私のアプローチは、次のように各行に行番号を追加することでした（多くのパイプコマンドを含む1つのインライン）。

for every file in *.txt
do

cat -v $file | nl -nrz -w4 -s$'\t' | .......

次に、次の方法で不要な行を削除します（はい）。

 ...... | sed '/^$/d'| grep -vEi 'unsettling|aforementioned|ruled'

最後に、次の方法でさらに処理するために2行を保持します。

........ | grep -A 1 -Ei 'university|institute|trust|college'

出力は次のとおりです（2つのファイルサンプリング）。

file 1.txt
0098  university of Goteborg is downtown and is one of the
0099  most beautiful building you can visit

0123  the institute of Oslo for advanced investigation
0124  is near the central station and keeps

0234  most important college of Munich
0235  and the most acclaimed teachers are

file 2.txt
0023  there is no trust or confidence
0024  in the counselor to accomplish the new

0182  usually the college is visited
0183  every term for the president but

[編集] このステップを逃しました。間違った行にあります。申し訳ありません。

その後、テキストは次のように「段落」として積み重ねられます。

tr '\n\r' ' '| grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}'

【編集終了】

この出力は変数「CLEANED_TXT」として保存され、次のようにWHILEにパイプされます。

while read everyline; do 

    if [[ -n "${everyline// }" ]];then

            echo "$file;$linenumber;$everyline" >> output.csv
    fi  

    done <<< "$CLEANED_TXT"

done  # for every text file

最終的な望ましい出力

file 1.txt;0098;university of Goteborg
file 1.txt;0123;the institute of Oslo
file 1.txt;0234;college of Munich

私の質問は行番号がありません。GREPがループの直前にあるので、これは最後のステップです。元の行番号が必要であることを考慮すると。ループ内で番号を戻すことはできません。

ついています。どんな助けでも大変感謝します。

挨拶

ベストアンサー1

アップデート2行全体を削除しtr ... | grep（ちょうど混乱）、while次に置き換えてください。

while read linenumber everyline; do
        everyline=$(echo $everyline | grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}')
        if [[ -n "$everyline" ]]; then
            echo "$file;$linenumber;$everyline" >> output.csv
        fi
done

正しい値を入力し、$linenumber正しい位置に単語を一致させます。

file1.txt;0098;university of Goteborg is downtown
file1.txt;0234;college of Munich
file1.txt;0182;college is visited

しかし、すべてがめちゃくちゃなので、同様の言語で書き直す必要perlがありますawk。

ベストアンサー1

おすすめ記事