コマンドラインで不用語リストを使用して、ファイル内で最も一般的なn個の単語を見つけます。

Question

次のテストファイルを検討してください。

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

単語数を取得するには：

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

どのように動作しますか？

grep -oE '[[:alpha:]]+' text.txt

これにより、スペースや句読点を除くすべての単語が1行に1単語ずつ返されます。
sort

これにより、単語がアルファベット順にソートされます。
uniq -c

各単語の発生回数を計算します。（操作するにはuniq入力をソートする必要があります。）
sort -nr

最も一般的な単語が一番上になるように、出力を数値順に並べ替えます。

混合状況処理

次の大文字と小文字の混合テストファイルを検討してください。

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

some我々が評価し、Some同じになりたい場合:

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

ここでは大文字と小文字を無視する-fオプションを追加し、大文字と小文字を無視するオプションを追加しました。sort-iuniq

不用語を除く

次の不用語を計算から除外したいとしましょう。

$ cat stopwords 
with
not
has
do

したがって、grep -v次の単語を削除するには、次を追加します。

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

Answer 1

次のテストファイルを検討してください。

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

単語数を取得するには：

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

どのように動作しますか？

grep -oE '[[:alpha:]]+' text.txt

これにより、スペースや句読点を除くすべての単語が1行に1単語ずつ返されます。
sort

これにより、単語がアルファベット順にソートされます。
uniq -c

各単語の発生回数を計算します。（操作するにはuniq入力をソートする必要があります。）
sort -nr

最も一般的な単語が一番上になるように、出力を数値順に並べ替えます。

混合状況処理

次の大文字と小文字の混合テストファイルを検討してください。

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

some我々が評価し、Some同じになりたい場合:

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

ここでは大文字と小文字を無視する-fオプションを追加し、大文字と小文字を無視するオプションを追加しました。sort-iuniq

不用語を除く

次の不用語を計算から除外したいとしましょう。

$ cat stopwords 
with
not
has
do

したがって、grep -v次の単語を削除するには、次を追加します。

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

コマンドラインで不用語リストを使用して、ファイル内で最も一般的なn個の単語を見つけます。

ベストアンサー1

どのように動作しますか？

混合状況処理

不用語を除く

おすすめ記事