検索と置換（引用符を除く）

Question

編集する：私の解決策は完全に過剰でした。私は何を考えたのか分からない。この問題は非常に単純な正規表現で解決できます。バラより解決策から提出されたジョアオ。

Pythonshlexライブラリほぼこれは基本的に機能します。以下はスクリプトの例です。

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""tokens.py"""

import sys
import shlex

with open(sys.argv[1], 'r') as textfile:
    text = ''.join(textfile.readlines())
    for token in shlex.split(text, posix=False):
        print(token)

たとえば、データがファイルにある場合は、data.txt次のように実行できます。

python tokens.py data.txt

これが生成する出力は次のとおりです。

これ
はい
一つ
はい
テキスト
そして
一部
スペース。
これ
しなければならない
はい
2位
ワイヤー。
しかし、
これ
スペース
〜サイ
「引用は変わってはいけない」
。
最後
ワイヤー

ピリオドは別の行に表示されます。これは、閉じる引用符でタグを終了するためです。あなたが提示した例では、複数行の文字列やエスケープ文字を必要としないようです。これが私が思いついたものです：

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""tokens.py"""

import sys

def tokenize(string):
    """Break a string into tokens using white-space as the only delimiter
    while respecting double-quoted substrings and keeping the double-quote
    characters in the resulting token."""

    # List to store the resulting list of tokens
    tokens = []

    # List to store characters as we build the current token
    token = []

    # Flag to keep track of whether or not
    # we're currently in a quoted substring
    quoted = False

    # Iterate through the string one character at a time
    for character in string:

        # If the character is a space then we either end the current
        # token (if quoted is False) or add the space to the current
        # token (if quoted is True)
        if character == ' ':
            if quoted:
                token.append(character)
            elif token:
                tokens.append(''.join(token))
                token = []

        # A double-quote character is always added to the token
        # It also toggles the 'quoted' flag
        elif character == '"':
            token.append(character)
            if quoted:
                quoted = False
            else:
                quoted = True

        # All other characters are added to the token
        else:
            token.append(character)

    # Whatever is left at the end becomes another token
    if token:
        tokens.append(''.join(token))

    # Return the resulting list of strings
    return(tokens)

if __name__=="__main__":
    """Read in text from a file and pring out the resulting tokens."""
    with open(sys.argv[1], 'r') as textfile:
        text = ''.join(textfile.readlines()).replace("\n", " ")
        for token in tokenize(text):
            print(token)

これにより、要求された結果が正確に生成されます。 Perlなどの他の言語でこのアルゴリズムを簡単に実装できます。私はPythonだけを好む。

Answer 1