 03 June 2022

# 正则中的Group、反向引用、前向引用

## 反向引用

`<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>`

`Testing <B><I>bold italic</I></B> text` 中的 `<B><I>bold italic</I></B>`.

`(q?)b\1` 匹配 `b`
`(q)?b\1` 不匹配 `b`

`(one)\7` 报错

`(\2two|(one))+` 匹配 `oneonetwo`

`(\1two|(one))+``(?>(\1two|(one)))+` 匹配 `oneonetwo`

`<(?<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</\k{tag}>`
`<(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\g{tag}>`
`<(?'tag'[A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>`
`<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>`

`python`中：
`<(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</(?P=tag)>`

`(a)(b)(c)\g<-1>`
`(a)(b)(c)\k'-1'`
`(a)(b)(c)\g'-1'`
`(a)(b)(c)\k-1`
`(a)(b)(c)\g-1`
`(a)(b)(c)\k{-1}`
`(a)(b)(c)\g{-1}`

`(a)(b)(c)\k<-3>` 匹配 `abca`

The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point.
Match a `q` not followed by a `u`, negative lookahead provides the solution: `q(?!u)`.

The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign. Positive lookahead works just the same. `q(?=u)` matches a `q` that is followed by a `u`, without making the u part of the match.

If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: `(?=(regex))`.

### Lookbehind

The construct for positive lookbehind is `(?<=text)`: a pair of parentheses, with the opening parenthesis followed by a question mark, “less than” symbol, and an equals sign.
Negative lookbehind is written as `(?<!text)`, using an exclamation point instead of an equals sign.

`(?<!a)b` matches a “b” that is not preceded by an “a”, using negative lookbehind. It doesn’t match `cab`, but matches the b (and only the b) in `bed` or `debt`.

If you want to find a word not ending with an “s”, you could use `\b\w+(?<!s)\b`. This is definitely not the same as `\b\w+[^s]\b`. When applied to John’s, the former matches John and the latter matches John’ (including the apostrophe). The correct regex without using lookbehind is `\b\w*[^s\W]\b` (star instead of plus, and \W in the character class).

### Lookaround

The fact that lookaround is zero-length automatically makes it atomic.

`(?=(\d+))\w+\1` does match `56x56` in `456x56`.

Find any word between 6 and 12 letters long containing either “cat”, “dog” or “mouse”: `\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*`.

### Keep Text out of The Match

To overcome the limitations of lookbehind, Perl 5.10, PCRE 7.2, Ruby 2.0, and Boost 1.42 introduced a new feature that can be used instead of lookbehind for its most common purpose. `\K` keeps the text matched so far out of the overall regex match.

`h\Kd` matches only the second `d` in `adhd`. Instead of lookbehind `(?<=h)d` match.

`[^a]\Kb` is the same as `(?<=[^a])b`, which are both different from `(?<!a)b`.