正则中的Group、反向引用、前向引用

03 June 2022

正则中的Group、反向引用、前向引用

参考：https://www.regular-expressions.info/backref.html

反向引用

使用反向引用重复匹配html标签
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
匹配
Testing bold italic text 中的 bold italic.

反向引用的重复问题
因为每次找到的新匹配会覆盖保存旧的匹配，所以重复反向引用要注意下面两种情况：([abc]+) 与 ([abc])+ 在匹配 cab时有明显的不同，第一个正则表达式会将 cab 放入第一个反向引用，而第二个正则表达式将只存储 b。这也意味着 ([abc]+)=\1 能匹配 cab=cab，而 ([abc])+=\1 不能匹配。

双字检查 \b(\w+)\s+\1\b 匹配 When editing text, doubled words such as “the the” easily creep in. 中的 the the.

反向引用失败组
(q?)b\1 匹配 b
(q)?b\1 不匹配 b

反向引用不存在组
(one)\7 报错

前向引用(不完全支持)
(\2two|(one))+ 匹配 oneonetwo

嵌套引用(不完全支持)
(\1two|(one))+ 或 (?>(\1two|(one)))+ 匹配 oneonetwo

反向引用与命名组
定义命名组：(?<name>group) or (?'name'group)
引用命名组：\k<one> or \g<one> or \k'one' or \g'one' or \k{one} or \g{one}
如，匹配html标签的regex，有以下几种写法： <(?<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</\k<tag>> <(?<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</\g<tag>> <(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\k<tag>> <(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\g<tag>> <(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\k'tag'>

<(?<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</\k{tag}>
<(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\g{tag}>
<(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>

在python中：
<(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</(?P=tag)>

多个同名组
如果想匹配a后跟数字0..5，或b后跟数字4..7，并且只关心数字，如果希望此匹配后跟c和前个完全相同的数字，可以使用：(?:a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit> 匹配 a2c2 或 b5c5。

相对反向引用
相对反向引用：k<-n> or g<-n> or \k'-n' or \g'-n' or \k-n or \g-n or {-n} or g{-n}
匹配 abcc, 以下写法： (a)(b)(c)\k<-1>
(a)(b)(c)\g<-1>
(a)(b)(c)\k'-1'
(a)(b)(c)\g'-1'
(a)(b)(c)\k-1
(a)(b)(c)\g-1
(a)(b)(c)\k{-1}
(a)(b)(c)\g{-1}

(a)(b)(c)\k<-3> 匹配 abca
嵌套(不完全支持)： (a)(b)(c\k<-2>) 匹配 abcb

Lookahead

The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point.
Match a q not followed by a u, negative lookahead provides the solution: q(?!u).

The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign. Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match.

If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)).

Lookbehind

The construct for positive lookbehind is (?<=text): a pair of parentheses, with the opening parenthesis followed by a question mark, “less than” symbol, and an equals sign.
Negative lookbehind is written as (?<!text), using an exclamation point instead of an equals sign.

(?<!a)b matches a “b” that is not preceded by an “a”, using negative lookbehind. It doesn’t match cab, but matches the b (and only the b) in bed or debt.

If you want to find a word not ending with an “s”, you could use \b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to John’s, the former matches John and the latter matches John’ (including the apostrophe). The correct regex without using lookbehind is \b\w*[^s\W]\b (star instead of plus, and \W in the character class).

Lookaround

The fact that lookaround is zero-length automatically makes it atomic.

(?=(\d+))\w+\1 does match 56x56 in 456x56.

Find any word between 6 and 12 letters long containing either “cat”, “dog” or “mouse”: \b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*.

Keep Text out of The Match

To overcome the limitations of lookbehind, Perl 5.10, PCRE 7.2, Ruby 2.0, and Boost 1.42 introduced a new feature that can be used instead of lookbehind for its most common purpose. \K keeps the text matched so far out of the overall regex match.

h\Kd matches only the second d in adhd. Instead of lookbehind (?<=h)d match.

[^a]\Kb is the same as (?<=[^a])b, which are both different from (?<!a)b.