テキスト内のすべてのnグラムを検索する1つのシェルコマンド

Question

1つの（主に）sedソリューション：

cat "$@" |
    tr -cs -- '._[:alpha:]' '[\n*]' |
    sed -n  -e 'h; :ms' \
            -e 'p; :ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:lower:]]*$/\1/p; t ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:upper:]]*$/\1/p; t ss' \
                -e 's/\([[:upper:]]\)[[:upper:]][[:lower:]]\+$/\1/p; t ss' \
                -e 's/[._][[:alpha:]][[:lower:]]*$//p; t ss' \
                -e 's/[._][[:upper:]]\+$//p; t ss' \
            -e 'g' \
            -e 's/^[[:upper:]]\?[[:lower:]]\+\([[:upper:]]\)/\1/; t mw' \
            -e 's/^[[:upper:]]\+\([[:upper:]][[:lower:]]\)/\1/; t mw' \
            -e 's/^[[:alpha:]][[:lower:]]*[._]//; t mw' \
            -e 's/^[[:upper:]]\+[._]//; t mw' \
            -e 'b' \
            -e ':mw; h; b ms'

アルゴリズムは

for each compound word (e.g., “FOO_BAR_test”) in the input
do
    repeat
        print what you’ve got
        repeat
            remove a small word from the end (e.g., “FOO_BAR_test” → “FOO_BAR”) and print what’s left
        until you’re down to the last one (e.g., “FOO_BAR_test” → “FOO”)
        go back to what you had at the beginning of the above loop
          and remove a small word from the beginning
          (e.g., “FOO_BAR_test” → “BAR_test”) ... but don’t print anything
    until you’re down to the last one (e.g., “FOO_BAR_test” → “test”)
end for loop

詳細：

cat "$@"UUOCです。私は一般的にこれを避けたいと思います。これは可能ですが、複数のファイルを直接渡すことはできません。tr args < filetr
tr -cs -- '._[:alpha:]' '[\n*]'たとえば、複数の複合語を複数行に分割します。
```
I_amAManTest you_haveAHouse FOO_BAR_test
```
~になる
```
I_amAManTest
you_haveAHouse
FOO_BAR_test
```
したがって、sedは一度に1つの複合語を処理できます。
sed -n- 自動的に何も印刷せず、コマンドがある場合にのみ印刷します。
-e以下を指定します。金利xpression は sed スクリプトの一部です。
h- パターンスペースを予約済みスペースにコピーします。
:ms- ラベル（メインループの始まり）
p- 印刷
:ss— ラベル（補助ループ開始）
次のコマンドは、複合語の終わりから小さな単語を削除し、成功すると結果を印刷し、補助ループの先頭に戻ります。
- s/$[[:lower:]]$[[:upper:]][[:lower:]]*$/\1/p; t ss- 「nTest」を「n」に変更します。
- s/$[[:lower:]]$[[:upper:]][[:upper:]]*$/\1/p; t ss- 「mOK」を「m」に変更します。
- s/$[[:upper:]]$[[:upper:]][[:lower:]]\+$/\1/p; t ss- 「AMan」を「A」に変更します。
- s/[._][[:alpha:]][[:lower:]]*$//p; t ss- "_am"を削除します（何も置き換えません）。
- s/[._][[:upper:]]\+$//p; t ss- "_BAR"を削除します（空のものと交換）。
これが補助ループの終わりです。
g- ホールドスペースをパターンスペースにコピーします（上のループの先頭に戻ります）。
次のコマンドは、複合語の先頭から小さな単語を削除し、成功するとメインループの最後にジャンプします（mw =メインループの要約）。
s/^[[:upper:]]\?[[:lower:]]\+$[[:upper:]]$/\1/; t mw— 「amA」を「A」に、「ManT」を「T」に変更します。
s/^[[:upper:]]\+$[[:upper:]][[:lower:]]$/\1/; t mw- 「AMa」を「Ma」に変更します。
s/^[[:alpha:]][[:lower:]]*[._]//; t mw- 「I_」と「you_」を削除します（何も置き換えません）。
s/^[[:upper:]]\+[._]//; t mw- "FOO_"を削除します（何も置き換えません）。
上記の各代替コマンドは、成功すると（何かが見つかった/一致する場合）、デフォルトのループサマリー（下）に移動します。ここまで来ると、パターン空間には小さな単語が1つだけ含まれているので、作業は終わったのです。
b— sed スクリプトの最後に分岐（ジャンプ）します。つまり、sedスクリプトを終了します。
:mw- 基本ループの要約のラベル。
h- メインループの次の繰り返しを準備するために、パターンスペースを予約済みスペースにコピーします。
b ms- メインループの先頭にジャンプします。

要求された出力を生成します。残念ながら、順序が異なります。重要な内容なら修正します。

$ echo "I_amAManTest you_haveAHouse FOO_BAR_test" | ./myscript
I_amAManTest
I_amAMan
I_amA
I_am
I
amAManTest
amAMan
amA
am
AManTest
AMan
A
ManTest
Man
Test
you_haveAHouse
you_haveA
you_have
you
haveAHouse
haveA
have
AHouse
A
House
FOO_BAR_test
FOO_BAR
FOO
BAR_test
BAR
Test

Answer 1

1つの（主に）sedソリューション：

cat "$@" |
    tr -cs -- '._[:alpha:]' '[\n*]' |
    sed -n  -e 'h; :ms' \
            -e 'p; :ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:lower:]]*$/\1/p; t ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:upper:]]*$/\1/p; t ss' \
                -e 's/\([[:upper:]]\)[[:upper:]][[:lower:]]\+$/\1/p; t ss' \
                -e 's/[._][[:alpha:]][[:lower:]]*$//p; t ss' \
                -e 's/[._][[:upper:]]\+$//p; t ss' \
            -e 'g' \
            -e 's/^[[:upper:]]\?[[:lower:]]\+\([[:upper:]]\)/\1/; t mw' \
            -e 's/^[[:upper:]]\+\([[:upper:]][[:lower:]]\)/\1/; t mw' \
            -e 's/^[[:alpha:]][[:lower:]]*[._]//; t mw' \
            -e 's/^[[:upper:]]\+[._]//; t mw' \
            -e 'b' \
            -e ':mw; h; b ms'

アルゴリズムは

for each compound word (e.g., “FOO_BAR_test”) in the input
do
    repeat
        print what you’ve got
        repeat
            remove a small word from the end (e.g., “FOO_BAR_test” → “FOO_BAR”) and print what’s left
        until you’re down to the last one (e.g., “FOO_BAR_test” → “FOO”)
        go back to what you had at the beginning of the above loop
          and remove a small word from the beginning
          (e.g., “FOO_BAR_test” → “BAR_test”) ... but don’t print anything
    until you’re down to the last one (e.g., “FOO_BAR_test” → “test”)
end for loop

詳細：

cat "$@"UUOCです。私は一般的にこれを避けたいと思います。これは可能ですが、複数のファイルを直接渡すことはできません。tr args < filetr
tr -cs -- '._[:alpha:]' '[\n*]'たとえば、複数の複合語を複数行に分割します。
```
I_amAManTest you_haveAHouse FOO_BAR_test
```
~になる
```
I_amAManTest
you_haveAHouse
FOO_BAR_test
```
したがって、sedは一度に1つの複合語を処理できます。
sed -n- 自動的に何も印刷せず、コマンドがある場合にのみ印刷します。
-e以下を指定します。金利xpression は sed スクリプトの一部です。
h- パターンスペースを予約済みスペースにコピーします。
:ms- ラベル（メインループの始まり）
p- 印刷
:ss— ラベル（補助ループ開始）
次のコマンドは、複合語の終わりから小さな単語を削除し、成功すると結果を印刷し、補助ループの先頭に戻ります。
- s/$[[:lower:]]$[[:upper:]][[:lower:]]*$/\1/p; t ss- 「nTest」を「n」に変更します。
- s/$[[:lower:]]$[[:upper:]][[:upper:]]*$/\1/p; t ss- 「mOK」を「m」に変更します。
- s/$[[:upper:]]$[[:upper:]][[:lower:]]\+$/\1/p; t ss- 「AMan」を「A」に変更します。
- s/[._][[:alpha:]][[:lower:]]*$//p; t ss- "_am"を削除します（何も置き換えません）。
- s/[._][[:upper:]]\+$//p; t ss- "_BAR"を削除します（空のものと交換）。
これが補助ループの終わりです。
g- ホールドスペースをパターンスペースにコピーします（上のループの先頭に戻ります）。
次のコマンドは、複合語の先頭から小さな単語を削除し、成功するとメインループの最後にジャンプします（mw =メインループの要約）。
s/^[[:upper:]]\?[[:lower:]]\+$[[:upper:]]$/\1/; t mw— 「amA」を「A」に、「ManT」を「T」に変更します。
s/^[[:upper:]]\+$[[:upper:]][[:lower:]]$/\1/; t mw- 「AMa」を「Ma」に変更します。
s/^[[:alpha:]][[:lower:]]*[._]//; t mw- 「I_」と「you_」を削除します（何も置き換えません）。
s/^[[:upper:]]\+[._]//; t mw- "FOO_"を削除します（何も置き換えません）。
上記の各代替コマンドは、成功すると（何かが見つかった/一致する場合）、デフォルトのループサマリー（下）に移動します。ここまで来ると、パターン空間には小さな単語が1つだけ含まれているので、作業は終わったのです。
b— sed スクリプトの最後に分岐（ジャンプ）します。つまり、sedスクリプトを終了します。
:mw- 基本ループの要約のラベル。
h- メインループの次の繰り返しを準備するために、パターンスペースを予約済みスペースにコピーします。
b ms- メインループの先頭にジャンプします。

要求された出力を生成します。残念ながら、順序が異なります。重要な内容なら修正します。

$ echo "I_amAManTest you_haveAHouse FOO_BAR_test" | ./myscript
I_amAManTest
I_amAMan
I_amA
I_am
I
amAManTest
amAMan
amA
am
AManTest
AMan
A
ManTest
Man
Test
you_haveAHouse
you_haveA
you_have
you
haveAHouse
haveA
have
AHouse
A
House
FOO_BAR_test
FOO_BAR
FOO
BAR_test
BAR
Test

テキスト内のすべてのnグラムを検索する1つのシェルコマンド

ベストアンサー1

おすすめ記事