Python に `string.split()` のジェネレーターバージョンはありますか? 質問する

Question

おそらくre.finditerメモリのオーバーヘッドは最小限に抑えられます。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

デモ：

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

私のテスト方法が正しかったと仮定すると、これは Python 3.2.1 で一定のメモリを使用することを確認しました。非常に大きなサイズ (1 GB 程度) の文字列を作成し、ループを使用して反復可能オブジェクトを反復しましたfor(リストの内包表記は使用していません。リストの内包表記では余分なメモリが生成されます)。これにより、メモリが目立ったほど増加することはありませんでした (つまり、メモリが増加したとしても、1 GB の文字列よりはるかに少ない量でした)。

より一般的なバージョン:

「との関連性がわかりませんstr.split」というコメントへの返信として、より一般的なバージョンを以下に示します。

def splitStr(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

アイデアは、((?!pat).)*パターンが一致し始めるまで貪欲に一致するようにすることでグループを「否定」することです（先読みは正規表現の有限状態マシンで文字列を消費しません）。疑似コードでは、繰り返し消費します（begin-of-stringxor {sep}）+as much as possible until we would be able to begin again (or hit end of string)

デモ：

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

（注意すべきはstr.split醜い動作をします。先頭と末尾の空白を削除するためsep=Noneに、まず特別な処理をしますstr.strip。上記は意図的にそれを行いません。sep= の最後の例を参照してください"\s+"。

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

Answer 1

おそらくre.finditerメモリのオーバーヘッドは最小限に抑えられます。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

デモ：

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

私のテスト方法が正しかったと仮定すると、これは Python 3.2.1 で一定のメモリを使用することを確認しました。非常に大きなサイズ (1 GB 程度) の文字列を作成し、ループを使用して反復可能オブジェクトを反復しましたfor(リストの内包表記は使用していません。リストの内包表記では余分なメモリが生成されます)。これにより、メモリが目立ったほど増加することはありませんでした (つまり、メモリが増加したとしても、1 GB の文字列よりはるかに少ない量でした)。

より一般的なバージョン:

「との関連性がわかりませんstr.split」というコメントへの返信として、より一般的なバージョンを以下に示します。

def splitStr(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

アイデアは、((?!pat).)*パターンが一致し始めるまで貪欲に一致するようにすることでグループを「否定」することです（先読みは正規表現の有限状態マシンで文字列を消費しません）。疑似コードでは、繰り返し消費します（begin-of-stringxor {sep}）+as much as possible until we would be able to begin again (or hit end of string)

デモ：

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

（注意すべきはstr.split醜い動作をします。先頭と末尾の空白を削除するためsep=Noneに、まず特別な処理をしますstr.strip。上記は意図的にそれを行いません。sep= の最後の例を参照してください"\s+"。

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

Python に `string.split()` のジェネレーターバージョンはありますか? 質問する

ベストアンサー1

より一般的なバージョン:

おすすめ記事