正则中的字符类、边界、可选项、量词

正则中的字符类、边界、可选项、量词

字符匹配
[A-Za-z_][A-Za-z_0-9]* 匹配编程语言的标识符;
0[xx][A-Fa-f0-9]+ 匹配一个C风格的十六进制数;
[^0-9\r\n] 匹配除数字与换行符外的所有字符;
q[^u] 匹配在 q 后不是字符 u 的字符串;
q(?!u) 匹配在 q 后不是字符 u 的所有字符 q;

特殊符号 \Q*\d+*\E 匹配 *\d+*
(\Q*\d+*\E)+ 匹配 *\d+**\d+*

在绝大多数的正则风格里,方括号]中的只有:反斜杆\、插入符^、连字符- 这3种特殊字符需要转义,其他特殊字符都不需要额外转义. []x] 匹配闭合方括号或x;[^]x] 匹配任意不是闭合方括号或x的字符; [-x][x-]匹配x或连字符;[^-x][^x-]匹配任意不是x或连字符的字符; [.*$] 匹配.号、*号、美元号;

([0-9])\1+ 匹配重复数字;

字符类差集(仅XPath,.NET,JGsoft支持)
[a-z-[aeiuo]] 匹配一个辅音字符,等价于[b-df-hj-np-tv-z]; 嵌套: [0-9-[0-6-[0-3]]] 匹配 0123789,等价于[0-37-9][^1234-[3456]] 匹配 1,2,3,4,5,6 之外的所有数字;

字符类交集(仅Java,Ruby,JGsoft支持)
[a-z&&[^aeiuo]] 匹配一个辅音字符,等价于[b-df-hj-np-tv-z]
嵌套: [^1234&&[3456]] 匹配 5,6,等价于[56]
[1234&&[^3456]] 匹配 1,2,等价于[12]

速记字符类
\d is short for [0-9];
\w stands for “word character”, it always matches the ASCII characters [A-Za-z0-9_];
\s stands for “whitespace character”, it includes [ \t\r\n\f];

Negated Shorthand Character Classes
\D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s];
[\D\S] is not the same as [^\d\s], [\D\S] matches any character digit, whitespace, or otherwise ;

\h matches horizontal whitespace,which includes the tab and all characters in the “space separator” Unicode category;
\v matches “vertical whitespace”, which includes all characters treated as line breaks in the Unicode standard, it is the same as [\n\cK\f\r\x85\x{2028}\x{2029}] ;

XML字符类(仅XML Schema,XPath,JGsoft V2支持) \i matches any character that may be the first character of an XML name.
\c matches any character that may occur after the first character in an XML name.
\i\c* matches an XML name like xml:schema. <\i\c*\s*> matches an opening XML tag without any attributes. </\i\c*\s*> matches any closing tag. <\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*\s*> matches an opening tag with any number of attributes. <(\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*|/\i\c*)\s*> matches either an opening tag with attributes or a closing tag.

非XML支持的Regex flavors 匹配XML If XML files are plain ASCII ,can use [_:A-Za-z] for \i and [-._:A-Za-z0-9] for \c.

If want to allow all Unicode characters that the XML standard allows. Instead of \i you would use: [:A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]
Instead of \c you would use: [-.0-9:A-Z_a-z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F\u2040\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

Use The Dot Sparingly
[01]\d[- /.][0-3]\d[- /.]\d\d matchs a date in mm/dd/yy format;
"[^"\r\n]*" matches "string" which is proper than ".*";

Anchors ^\d+$ match the entire string must consist of digits; ^\s+ matches leading whitespace and \s+$ matches trailing whitespace;

\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string. These two tokens never match at line breaks. POSIX regular expressions use \` (backtick) to match the start of the string, and \' (single quote) to match the end of the string.

Word Boundaries
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b.
\w are the characters that are treated as word characters by word boundaries. \B is the negated version of \b. \B matches at any position between two word characters as well as at any position between two non-word characters.
Besides,GNU also uses its own syntax for start-of-word and end-of-word boundaries. \< matches at the start of a word, like Tcl’s \m. \> matches at the end of a word, like Tcl’s \M.
The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary.

Optional Items
colou?r matches both colour and color.
Nov(ember)? matches Nov and November.

You can also use curly braces to make something optional. colou{0,1}r is the same as colou?r.

If you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003, the match is always Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first Feb 23(rd)??.

非打印字符 Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), and \f (form feed, 0x0C). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

In some flavors, \v matches the vertical tab (ASCII 0x0B). In other flavors, \v is a shorthand that matches any vertical whitespace character.

In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9.

\R is a special escape that matches any line break, including Unicode line breaks, so \R can also match a lone CR or a lone LF.

Unicode字符
可以使用 \uFFFF\x{FFFF} 插入 Unicode 字符;
\x{00E0} 匹配 à\x{0061} 匹配 a

You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.

Reference from:https://www.regular-expressions.info/unicode.html

Mode Modifiers you can add the following mode modifiers to the start of the regex. To specify multiple modes, simply put them together as in (?ismx).

(?i) makes the regex case insensitive. (?x) turn on free-spacing mode. (?xx) turn on free-spacing mode, also in character classes. (?s) for “single line mode” makes the dot match all characters, including line breaks. (?m) for “multi-line mode” makes the caret and dollar match at the start and end of each line in the subject string, (?m) also prevents the dot from matching line breaks.

you can turn off modes by preceding them with a minus sign. All modes after the minus sign will be turned off. E.g. (?i-sm) turns on case insensitivity, and turns off both single-line mode and multi-line mode. The regex (?i)te(?-i)st should match test and TEst, but not teST or TEST.

Instead of using two modifiers, one to turn an option on, and one to turn it off, you use a modifier span. (?i)caseless(?-i)cased(?i)caseless is equivalent to (?i)caseless(?-i:cased)caseless which match CaseLESScasedCASELess. This syntax resembles that of the non-capturing group (?:group).

所有格量词(Possessive Quantifiers) 因为所有格量词不必记住任何回溯位置,所有格量词性能远高于非所有格量词,可以通过在其后放置一个额外的 + 来使量词具有所有格。 * 是贪婪的,*? 是懒惰的, *+ 是所有格。 ++?+{n,m}+ 也是所有格。

"[^"]*+" 匹配 "abc"
使用所有格量词可能改变匹配尝试的结果,比如 ".*" 匹配 "abc"x, 但 ".*+" 不能匹配 "abc"x

使用原子分组代替所有格量词
基本上,用 (?>X*) 代替 X*+(?:a|b)*+ 等价于 (?>(?:a|b)*) 但不等价于 (?>a|b)*(?:a|b)*+b(?>(?:a|b)*)b 都无法匹配 ba|b 匹配 b,另外 (?>a|b)*b 可以匹配 b。 在正则表达式 (?>a|b)*b 中,原子组强制交替放弃其回溯位置。 这意味着如果 a 匹配,如果正则表达式的其余部分失败,它将不会返回尝试 b。 由于这*在组外,它是一个正常的、贪婪的*。 当第二个 b 失败时,贪心*回溯到 0 迭代。 然后,第二个 b 匹配主题字符串中的 b

Branch Reset Groups
The syntax is (?|regex) where (?| opens the group and regex is any regular expression.

The regex (?|(a)|(b)|(c)) consists of a single branch reset group with three alternatives. This regex matches either a, b, or c. The regex has only a single capturing group with number 1 that is shared by all three alternatives. After the match, $1 holds a, b, or c. such as (?|(a)|(b)|(c))\1 matches aa, bb, or cc.

The alternatives in the branch reset group don’t need to have the same number of capturing groups. (?|abc|(d)(e)(f)|g(h)i) has three capturing groups. When this regex matches abc, all three groups are empty. When def is matched, $1 holds d, $2 holds e and $3 holds f. When ghi is matched, $1 holds h overrided while the other two are empty.

(x)(?|abc|(d)(e)(f)|g(h)i)(y) defines five capturing groups. (x) is group 1, (d) and (h) are group 2, (e) is group 3, (f) is group 4, and (y) is group 5.

(?'before'x)(?|abc|(?'left'd)(?'middle'e)(?'right'f)|g(?'left'h)i)(?'after'y) is the same as the previous regex. It names the five groups “before”, “left”, “middle”, “right”, and “after”. Notice that because the 3rd alternative has only one capturing group, that must be the name of the first group in the other alternatives.

If you omit the names in some alternatives, the groups will still share the names with the other alternatives. In the regex (?'before'x)(?|abc|(?'left'd)(?'middle'e)(?'right'f)|g(h)i)(?'after'y) the group (h) is still named “left” because the branch reset group makes it share the name and number of (?'left'd).

Free-Spacing Regular Expressions
you can put (?x) the very start of the regex to make the remainder of the regex free-spacing.
In free-spacing mode, whitespace between regular expression tokens is ignored. Whitespace includes spaces, tabs, and line breaks. a b c is the same as abc in free-spacing mode. But \ d and \d are not the same.
Likewise, grouping modifiers cannot be broken up. (?>atomic) is the same as (?> ato mic ) and as ( ?>ato mic). They all match the same atomic group.
The ?> grouping modifier is a single element in the regex syntax, and must stay together.
Perl 5.26 and PCRE 10.30 also add a new mode modifier (?xx) which enables free-spacing both inside and outside character classes. (?x) turns on free-spacing outside character classes like before, but also turns off free-spacing inside character classes. (?-x) and (?-xx) both completely turn off free-spacing.

版权所有,转载请注明出处 luowei.github.io.