正規表現の学習 [終了] 質問する

Question

最も重要なのは概念です。構成要素がどのように機能するかを理解すれば、構文の違いはわずかな方言に過ぎません。正規表現エンジンの構文の上の層は、使用しているプログラミング言語の構文です。Perl などの言語では、この複雑さのほとんどが解消されますが、C プログラムで正規表現を使用する場合は、他の考慮事項に留意する必要があります。

正規表現を、自由に組み合わせることができる構成要素として考えると、独自のパターンを記述してデバッグする方法を学ぶだけでなく、他の人が記述したパターンを理解する方法も学ぶことができます。

シンプルに始める

概念的には、最も単純な正規表現はリテラル文字です。パターンはN文字「N」に一致します。

隣り合う正規表現はシーケンスに一致します。たとえば、パターンはNick「N」、「i」、「c」、「k」のシーケンスに一致します。

Unix でを使用したことがある場合grep(普通に見える文字列を検索するだけであったとしても)、すでに正規表現を使用していることになります。(「」はre正規grep表現を指します。)

メニューから注文する

少しだけ複雑にすると、「Nick」または「nick」のいずれかをパターンに一致させることができます[Nn]ick。角括弧の部分は文字クラスこれは、囲まれた文字の 1 つと正確に一致することを意味します。文字クラスで範囲を使用することもできます。つまり、[a-c]「a」、「b」、「c」のいずれかに一致します。

このパターンは.特殊です。ドットのみにマッチするのではなく、どれでも文字^†。概念的には非常に大きな文字クラスと同じです[-.?+%$A-Za-z0-9...]。

キャラクタークラスをメニューとして考えます。1 つだけ選択します。

便利なショートカット

Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).

The uppercased variants are their complements, so \S matches any non-whitespace character, for example.

Once is not enough

From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are

* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)

Putting some of these blocks together, the pattern [Nn]*ick matches all of

ick
Nick
nick
Nnick
nNick
nnick
(and so on)

The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.

A few other useful examples:

[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01

Grouping

A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.

To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.

Alternation

Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).

For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

Escaping

Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.

Greediness

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

例えば、入力が

「こんにちは」と彼女は言った。「お元気ですか？」

「Hello」のみが一致すると予想していた場合".+"、「Hello」から「you?」まですべて一致したことに驚くかもしれません。

貪欲から慎重と考えられるものに切り替えるには、?量指定子に余分なものを追加します。これで$(.+?)$、質問の例のがどのように機能するかがわかりました。これは、リテラルの左括弧の後に 1 つ以上の文字が続き、右括弧で終了するシーケンスと一致します。

入力が「(123) (456)」の場合、最初のキャプチャは「123」になります。非貪欲な量指定子は、パターンの残りの部分ができるだけ早く一致を開始できるようにする必要があります。

(あなたの混乱については、((.+?))同じことを行う正規表現方言を私は知りません。送信の途中で何かが失われたのではないかと思います。)

アンカー

^入力の先頭のみに一致させ、末尾のみに一致させるには、特別なパターンを使用します$。パターンを使用して「先頭と末尾の内容はわかっているが、その間にあるものはすべて教えてください」という「ブックエンド」を作成するのは便利なテクニックです。

フォームのコメントを一致させたいとします

-- This is a comment --

と書くでしょう^--\s+(.+)\s+--$。

あなた自身のものをつくる

正規表現は再帰的なので、基本的なルールを理解したら、好きなように組み合わせることができます。

正規表現の作成とデバッグのためのツール:

正規表現（JavaScriptの場合）
パール:YAPE: 正規表現の説明
正規表現コーチ（エンジンはCL-PPCRE）
正規表現パル（JavaScriptの場合）
正規表現オンラインテスター
正規表現バディ
正規表現101(PCRE、JavaScript、Python、Golang、Java 8 用)
正規表現が嫌い
ビジュアル正規表現
エスプレッソ(.NET の場合)
ルビュラー(ルビー用)
正規表現ライブラリ(一般的なシナリオ向けの定義済み正規表現)
テキスト2RE
正規表現テスター（JavaScriptの場合）
正規表現ストーム(.NET の場合)
デバゲックス(ビジュアル正規表現テスターおよびヘルパー)

書籍

無料リソース

脚注

†:上の文は、.どの文字にもマッチするという説明は、教育目的の単純化であり、厳密には正しくありません。ドットは改行を除くどの文字にもマッチしますが、実際には改行の境界を越える"\n"ようなパターンはほとんど想定されません。Perlの正規表現には、.+/sスイッチおよびJavaPattern.DOTALLたとえば、を使用すると、.任意の文字に一致します。このような機能がない言語の場合は、のように使用して、[\s\S]「任意の空白文字または任意の非空白文字」、つまり任意の文字に一致させることができます。

Answer 1

最も重要なのは概念です。構成要素がどのように機能するかを理解すれば、構文の違いはわずかな方言に過ぎません。正規表現エンジンの構文の上の層は、使用しているプログラミング言語の構文です。Perl などの言語では、この複雑さのほとんどが解消されますが、C プログラムで正規表現を使用する場合は、他の考慮事項に留意する必要があります。

正規表現を、自由に組み合わせることができる構成要素として考えると、独自のパターンを記述してデバッグする方法を学ぶだけでなく、他の人が記述したパターンを理解する方法も学ぶことができます。

シンプルに始める

概念的には、最も単純な正規表現はリテラル文字です。パターンはN文字「N」に一致します。

隣り合う正規表現はシーケンスに一致します。たとえば、パターンはNick「N」、「i」、「c」、「k」のシーケンスに一致します。

Unix でを使用したことがある場合grep(普通に見える文字列を検索するだけであったとしても)、すでに正規表現を使用していることになります。(「」はre正規grep表現を指します。)

メニューから注文する

少しだけ複雑にすると、「Nick」または「nick」のいずれかをパターンに一致させることができます[Nn]ick。角括弧の部分は文字クラスこれは、囲まれた文字の 1 つと正確に一致することを意味します。文字クラスで範囲を使用することもできます。つまり、[a-c]「a」、「b」、「c」のいずれかに一致します。

このパターンは.特殊です。ドットのみにマッチするのではなく、どれでも文字^†。概念的には非常に大きな文字クラスと同じです[-.?+%$A-Za-z0-9...]。

キャラクタークラスをメニューとして考えます。1 つだけ選択します。

便利なショートカット

Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).

The uppercased variants are their complements, so \S matches any non-whitespace character, for example.

Once is not enough

From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are

* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)

Putting some of these blocks together, the pattern [Nn]*ick matches all of

ick
Nick
nick
Nnick
nNick
nnick
(and so on)

The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.

A few other useful examples:

[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01

Grouping

A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.

To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.

Alternation

Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).

For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

Escaping

Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.

Greediness

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

例えば、入力が

「こんにちは」と彼女は言った。「お元気ですか？」

「Hello」のみが一致すると予想していた場合".+"、「Hello」から「you?」まですべて一致したことに驚くかもしれません。

貪欲から慎重と考えられるものに切り替えるには、?量指定子に余分なものを追加します。これで$(.+?)$、質問の例のがどのように機能するかがわかりました。これは、リテラルの左括弧の後に 1 つ以上の文字が続き、右括弧で終了するシーケンスと一致します。

入力が「(123) (456)」の場合、最初のキャプチャは「123」になります。非貪欲な量指定子は、パターンの残りの部分ができるだけ早く一致を開始できるようにする必要があります。

(あなたの混乱については、((.+?))同じことを行う正規表現方言を私は知りません。送信の途中で何かが失われたのではないかと思います。)

アンカー

^入力の先頭のみに一致させ、末尾のみに一致させるには、特別なパターンを使用します$。パターンを使用して「先頭と末尾の内容はわかっているが、その間にあるものはすべて教えてください」という「ブックエンド」を作成するのは便利なテクニックです。

フォームのコメントを一致させたいとします

-- This is a comment --

と書くでしょう^--\s+(.+)\s+--$。

あなた自身のものをつくる

正規表現は再帰的なので、基本的なルールを理解したら、好きなように組み合わせることができます。

正規表現の作成とデバッグのためのツール:

正規表現（JavaScriptの場合）
パール:YAPE: 正規表現の説明
正規表現コーチ（エンジンはCL-PPCRE）
正規表現パル（JavaScriptの場合）
正規表現オンラインテスター
正規表現バディ
正規表現101(PCRE、JavaScript、Python、Golang、Java 8 用)
正規表現が嫌い
ビジュアル正規表現
エスプレッソ(.NET の場合)
ルビュラー(ルビー用)
正規表現ライブラリ(一般的なシナリオ向けの定義済み正規表現)
テキスト2RE
正規表現テスター（JavaScriptの場合）
正規表現ストーム(.NET の場合)
デバゲックス(ビジュアル正規表現テスターおよびヘルパー)

書籍

無料リソース

脚注

†:上の文は、.どの文字にもマッチするという説明は、教育目的の単純化であり、厳密には正しくありません。ドットは改行を除くどの文字にもマッチしますが、実際には改行の境界を越える"\n"ようなパターンはほとんど想定されません。Perlの正規表現には、.+/sスイッチおよびJavaPattern.DOTALLたとえば、を使用すると、.任意の文字に一致します。このような機能がない言語の場合は、のように使用して、[\s\S]「任意の空白文字または任意の非空白文字」、つまり任意の文字に一致させることができます。

正規表現の学習 [終了] 質問する

ベストアンサー1

シンプルに始める

メニューから注文する

便利なショートカット

Once is not enough

Grouping

Alternation

Escaping

Greediness

アンカー

あなた自身のものをつくる

正規表現の作成とデバッグのためのツール:

書籍

無料リソース

脚注

おすすめ記事