Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Regular expression
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===IEEE POSIX Standard <span class="anchor" id="POSIX"></span>=== The [[Institute of Electrical and Electronics Engineers|IEEE]] [[POSIX]] standard has three sets of compliance: '''BRE''' (Basic Regular Expressions),<ref>ISO/IEC 9945-2:1993 ''Information technology β Portable Operating System Interface (POSIX) β Part 2: Shell and Utilities'', successively revised as ISO/IEC 9945-2:2002 ''Information technology β Portable Operating System Interface (POSIX) β Part 2: System Interfaces'', ISO/IEC 9945-2:2003, and currently ISO/IEC/IEEE 9945:2009 ''Information technology β Portable Operating System Interface (POSIX) Base Specifications, Issue 7''</ref> '''ERE''' (Extended Regular Expressions), and '''SRE''' (Simple Regular Expressions). SRE is [[deprecation|deprecated]],<ref>The Single Unix Specification (Version 2)</ref> in favor of BRE, as both provide [[backward compatibility]]. The subsection below covering the ''character classes'' applies to both BRE and ERE. BRE and ERE work together. ERE adds <code>?</code>, <code>+</code>, and <code>|</code>, and it removes the need to escape the metacharacters <code>( )</code> and <code>{ }</code>, which are ''required'' in BRE. Furthermore, as long as the POSIX standard syntax for regexes is adhered to, there can be, and often is, additional syntax to serve specific (yet POSIX compliant) applications. Although POSIX.2 leaves some implementation specifics undefined, BRE and ERE provide a "standard" which has since been adopted as the default syntax of many tools, where the choice of BRE or ERE modes is usually a supported option. For example, [[GNU]] <code>grep</code> has the following options: "<code>grep -E</code>" for ERE, and "<code>grep -G</code>" for BRE (the default), and "<code>grep -P</code>" for [[Perl]] regexes. Perl regexes have become a de facto standard, having a rich and powerful set of atomic expressions. Perl has no "basic" or "extended" levels. As in POSIX EREs, <code>( )</code> and <code>{ }</code> are treated as metacharacters unless escaped; other metacharacters are known to be literal or symbolic based on context alone. Additional functionality includes [[#Lazy matching|lazy matching]], [[#backreferences|backreferences]], named capture groups, and [[recursion (computer science)|recursive]] patterns. ====POSIX basic and extended==== In the [[POSIX]] standard, Basic Regular Syntax ('''BRE''') requires that the [[metacharacter]]s <code>( )</code> and <code>{ }</code> be designated <code>\(\)</code> and <code>\{\}</code>, whereas Extended Regular Syntax ('''ERE''') does not. {| class="wikitable" |- ! Metacharacter ! Description |- valign="top" !<code>^</code> |Matches the starting position within the string. In line-based tools, it matches the starting position of any line. |- valign="top" !<code>.</code> |Matches any single character (many applications exclude [[newline]]s, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, <code>a.c</code> matches "abc", etc., but <code>[a.c]</code> matches only "a", ".", or "c". |- valign="top" !<code>[ ]</code> |A bracket expression. Matches a single character that is contained within the brackets. For example, <code>[abc]</code> matches "a", "b", or "c". <code>[a-z]</code> specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: <code>[abcx-z]</code> matches "a", "b", "c", "x", "y", or "z", as does <code>[a-cx-z]</code>. The <code>-</code> character is treated as a literal character if it is the last or the first (after the <code>^</code>, if present) character within the brackets: <code>[abc-]</code>, <code>[-abc]</code>, <code>[^-abc]</code>. Backslash escapes are not allowed. The <code>]</code> character can be included in a bracket expression if it is the first (after the <code>^</code>, if present) character: <code>[]abc]</code>, <code>[^]abc]</code>. |- valign="top" !<code>[^ ]</code> |Matches a single character that is not contained within the brackets. For example, <code>[^abc]</code> matches any character other than "a", "b", or "c". <code>[^a-z]</code> matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed. |- valign="top" !<code>$</code> |Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line. |- valign="top" !<code>( )</code> |Defines a marked subexpression, also called a capturing group, which is essential for extracting the desired part of the text (See also the next entry, <code>\''n''</code>). ''BRE mode requires {{nowrap|<code>\( \)</code>}}.'' |- valign="top" !<code>\''n''</code> |Matches what the ''n''th marked subexpression matched, where ''n'' is a digit from 1 to 9. This construct is defined in the POSIX standard.<ref>{{cite book |section-url=https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_06 |publisher=The Open Group |title=The Open Group Base Specifications Issue 7, 2018 edition |section=9.3.6 BREs Matching Multiple Characters |year=2017 |access-date=December 10, 2023}}</ref> Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode. |- valign="top" !<code>*</code> |Matches the preceding element zero or more times. For example, <code>ab*c</code> matches "ac", "abc", "abbbc", etc. <code>[xyz]*</code> matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. <code>(ab)*</code> matches "", "ab", "abab", "ababab", and so on. |- valign="top" !{{nowrap|<code>{''m'',''n''}</code>}} |Matches the preceding element at least ''m'' and not more than ''n'' times. For example, <code>a{3,5}</code> matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regexes. BRE mode requires <code>{{nowrap|\{''m'',''n''\}}}</code>. |} '''Examples:''' * <code>.at</code> matches any three-character string ending with "at", including "hat", "cat", "bat", "4at", "#at" and " at" (starting with a space). * <code>[hc]at</code> matches "hat" and "cat". * <code>[^b]at</code> matches all strings matched by <code>.at</code> except "bat". * <code>[^hc]at</code> matches all strings matched by <code>.at</code> other than "hat" and "cat". * <code>^[hc]at</code> matches "hat" and "cat", but only at the beginning of the string or line. * <code>[hc]at$</code> matches "hat" and "cat", but only at the end of the string or line. * <code>\[.\]</code> matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]", "[b]", "[7]", "[@]", "[]]", and "[ ]" (bracket space bracket). * <code>s.*</code> matches s followed by zero or more characters, for example: "s", "saw", "seed", "s3w96.7", and "s6#h%(>>>m n mQ". According to Russ Cox, the POSIX specification requires ambiguous subexpressions to be handled in a way different from Perl's. The committee replaced Perl's rules with one that is simple to explain, but the new "simple" rules are actually more complex to implement: they were incompatible with pre-existing tooling and made it essentially impossible to define a "lazy match" (see below) extension. As a result, very few programs actually implement the POSIX subexpression rules (even when they implement other parts of the POSIX syntax).<ref>{{cite web |title=Regular Expression Matching: the Virtual Machine Approach |url=https://swtch.com/~rsc/regexp/regexp2.html |author=Russ Cox |year=2009 |website=swtch.com |quote=Digression: POSIX Submatching}}</ref> ====Metacharacters in POSIX extended==== The meaning of metacharacters [[escape character|escaped]] with a backslash is reversed for some characters in the POSIX Extended Regular Expression ('''ERE''') syntax. With this syntax, a backslash causes the metacharacter to be treated as a literal character. So, for example, <code>\( \)</code> is now <code>( )</code> and <code>\{ \}</code> is now <code>{ }</code>. Additionally, support is removed for <code>\''n''</code> backreferences and the following metacharacters are added: {| class="wikitable" |- ! Metacharacter ! Description |- valign="top" ! <code>?</code> | Matches the preceding element zero or one time. For example, <code>ab?c</code> matches only "ac" or "abc". |- ! <code>+</code> | Matches the preceding element one or more times. For example, <code>ab+c</code> matches "abc", "abbc", "abbbc", and so on, but not "ac". |- ! <code><nowiki>|</nowiki></code> | The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, <code><nowiki>abc|def</nowiki></code> matches "abc" or "def". |} '''Examples:''' * <code>[hc]?at</code> matches "at", "hat", and "cat". * <code>[hc]*at</code> matches "at", "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on. * <code>[hc]+at</code> matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at". * <code>cat|dog</code> matches "cat" or "dog". POSIX Extended Regular Expressions can often be used with modern Unix utilities by including the [[command line]] flag <var>-E</var>. ====Character classes==== The character class is the most basic regex concept after a literal match. It makes one small sequence of characters match a larger set of characters. For example, <syntaxhighlight lang="ragel" inline>[A-Z]</syntaxhighlight> could stand for any uppercase letter in the English alphabet, and <syntaxhighlight lang="ragel" inline>\d</syntaxhighlight> could mean any digit. Character classes apply to both POSIX levels. When specifying a range of characters, such as <syntaxhighlight lang="ragel" inline>[a-Z]</syntaxhighlight> (i.e. lowercase ''<syntaxhighlight lang="ragel" inline>a</syntaxhighlight>'' to uppercase ''<syntaxhighlight lang="ragel" inline>Z</syntaxhighlight>''), the computer's locale settings determine the contents by the numeric ordering of the character encoding. They could store digits in that sequence, or the ordering could be ''abc...zABC...Z'', or ''aAbBcC...zZ''. So the POSIX standard defines a character class, which will be known by the regex processor installed. Those definitions are in the following table: {| class="wikitable sortable" |- ! Description ! POSIX !! Perl/Tcl !! Vim !! Java !! ASCII |- | ASCII characters | | | | <syntaxhighlight lang="ragel" inline>\p{ASCII}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[\x00-\x7F]</syntaxhighlight> |- | Alphanumeric characters | <syntaxhighlight lang="ragel" inline>[:alnum:]</syntaxhighlight> | | | <syntaxhighlight lang="ragel" inline>\p{Alnum}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[A-Za-z0-9]</syntaxhighlight> |- | Alphanumeric characters plus "_" | | <syntaxhighlight lang="ragel" inline>\w</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\w</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\w</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[A-Za-z0-9_]</syntaxhighlight> |- | Non-word characters | | <syntaxhighlight lang="ragel" inline>\W</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\W</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\W</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[^A-Za-z0-9_]</syntaxhighlight> |- | Alphabetic characters | <syntaxhighlight lang="ragel" inline>[:alpha:]</syntaxhighlight> | | <syntaxhighlight lang="ragel" inline>\a</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Alpha}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[A-Za-z]</syntaxhighlight> |- | Space and tab | <syntaxhighlight lang="ragel" inline>[:blank:]</syntaxhighlight> | | <syntaxhighlight lang="ragel" inline>\s</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Blank}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[ \t]</syntaxhighlight> |- | Word boundaries | | <syntaxhighlight lang="ragel" inline>\b</syntaxhighlight> | <code>\< \></code> | <syntaxhighlight lang="ragel" inline>\b</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>(?<=\W)(?=\w)|(?<=\w)(?=\W)</syntaxhighlight> |- | Non-word boundaries | | | | <syntaxhighlight lang="ragel" inline>\B</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>(?<=\W)(?=\W)|(?<=\w)(?=\w)</syntaxhighlight> |- | [[Control character]]s | <syntaxhighlight lang="ragel" inline>[:cntrl:]</syntaxhighlight> | | | <syntaxhighlight lang="ragel" inline>\p{Cntrl}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[\x00-\x1F\x7F]</syntaxhighlight> |- | Digits | <syntaxhighlight lang="ragel" inline>[:digit:]</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\d</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\d</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Digit}</syntaxhighlight> or <syntaxhighlight lang="ragel" inline>\d</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[0-9]</syntaxhighlight> |- | Non-digits | | <syntaxhighlight lang="ragel" inline>\D</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\D</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\D</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[^0-9]</syntaxhighlight> |- | Visible characters | <syntaxhighlight lang="ragel" inline>[:graph:]</syntaxhighlight> | | | <syntaxhighlight lang="ragel" inline>\p{Graph}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[\x21-\x7E]</syntaxhighlight> |- | Lowercase letters | <syntaxhighlight lang="ragel" inline>[:lower:]</syntaxhighlight> | | <syntaxhighlight lang="ragel" inline>\l</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Lower}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[a-z]</syntaxhighlight> |- | Visible characters and the space character | <syntaxhighlight lang="ragel" inline>[:print:]</syntaxhighlight> | | <syntaxhighlight lang="ragel" inline>\p</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Print}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[\x20-\x7E]</syntaxhighlight> |- | Punctuation characters | <syntaxhighlight lang="ragel" inline>[:punct:]</syntaxhighlight> | | | <syntaxhighlight lang="ragel" inline>\p{Punct}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]</syntaxhighlight> |- | [[Whitespace character]]s | <syntaxhighlight lang="ragel" inline>[:space:]</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\s</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\_s</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Space}</syntaxhighlight> or <syntaxhighlight lang="ragel" inline>\s</syntaxhighlight> | <code>[ [[\t]][[\r]][[\n]][[\v]][[\f]]]</code> |- | Non-whitespace characters | | <syntaxhighlight lang="ragel" inline>\S</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\S</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\S</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[^ \t\r\n\v\f]</syntaxhighlight> |- | Uppercase letters | <syntaxhighlight lang="ragel" inline>[:upper:]</syntaxhighlight> | | <syntaxhighlight lang="ragel" inline>\u</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{Upper}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[A-Z]</syntaxhighlight> |- | Hexadecimal digits | <syntaxhighlight lang="ragel" inline>[:xdigit:]</syntaxhighlight> | | <syntaxhighlight lang="ragel" inline>\x</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>\p{XDigit}</syntaxhighlight> | <syntaxhighlight lang="ragel" inline>[A-Fa-f0-9]</syntaxhighlight> |} POSIX character classes can only be used within bracket expressions. For example, <syntaxhighlight lang="ragel" inline>[[:upper:]ab]</syntaxhighlight> matches the uppercase letters and lowercase "a" and "b". An additional non-POSIX class understood by some tools is <syntaxhighlight lang="ragel" inline>[:word:]</syntaxhighlight>, which is usually defined as <syntaxhighlight lang="ragel" inline>[:alnum:]</syntaxhighlight> plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor [[Vim (text editor)|Vim]] further distinguishes ''word'' and ''word-head'' classes (using the notation <syntaxhighlight lang="ragel" inline>\w</syntaxhighlight> and <syntaxhighlight lang="ragel" inline>\h</syntaxhighlight>) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions: numbers are generally excluded, so an identifier would look like <syntaxhighlight lang="ragel" inline>\h\w*</syntaxhighlight> or <syntaxhighlight lang="ragel" inline>[[:alpha:]_][[:alnum:]_]*</syntaxhighlight> in POSIX notation. Note that what the POSIX regex standards call ''character classes'' are commonly referred to as ''POSIX character classes'' in other regex flavors which support them. With most other regex flavors, the term ''character class'' is used to describe what POSIX calls ''bracket expressions''.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)