Editing Regular expression (section)

==Formal language theory==
Regular expressions describe [[regular language]]s in [[formal language|formal language theory]]. They have the same expressive power as [[regular grammar]]s. But the language of regular expressions itself, is [[context-free language]].

===Formal definition===
Regular expressions consist of constants, which denote sets of strings, and operator symbols, which denote operations over these sets. The following definition is standard, and found as such in most textbooks on formal language theory.<ref name="HopcroftMotwaniUllman01">{{harvtxt|Hopcroft|Motwani|Ullman|2000}}</ref><ref>{{harvtxt|Sipser|1998}}</ref> Given a finite [[alphabet (computer science)|alphabet]] Σ, the following constants are defined
as regular expressions:
* (''empty set'') ∅ denoting the set ∅.
* (''[[empty string]]'') ε denoting the set containing only the "empty" string, which has no characters at all.
* (''[[string literal|literal character]]'') <code>a</code> in Σ denoting the set containing only the character ''a''.

Given regular expressions R and S, the following operations over them are defined
to produce regular expressions:
* (''[[concatenation]]'') <code>(RS)</code> denotes the set of strings that can be obtained by concatenating a string accepted by R and a string accepted by S (in that order). For example, let R denote {"ab", "c"} and S denote {"d", "ef"}. Then, <code>(RS)</code> denotes {"abd", "abef", "cd", "cef"}.
* (''[[alternation (formal language theory)|alternation]]'') <code>(R|S)</code> denotes the [[set union]] of sets described by R and S. For example, if R describes {"ab", "c"} and S describes {"ab", "d", "ef"}, expression <code>(R|S)</code> describes {"ab", "c", "d", "ef"}.
* (''[[Kleene star]]'') <code>(R*)</code> denotes the smallest [[subset|superset]] of the set described by ''R'' that contains ε and is [[closure (mathematics)|closed]] under string concatenation. This is the set of all strings that can be made by concatenating any finite number (including zero) of strings from the set described by R. For example, if R denotes {"0", "1"}, <code>(R*)</code> denotes the set of all finite [[binary string]]s (including the empty string). If R denotes {"ab", "c"}, <code>(R*)</code> denotes {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ...}.

To avoid parentheses, it is assumed that the Kleene star has the highest priority followed by concatenation, then alternation. If there is no ambiguity, then parentheses may be omitted. For example, <code>(ab)c</code> can be written as <code>abc</code>, and <code>a|(b(c*))</code> can be written as <code>a|bc*</code>. Many textbooks use the symbols ∪, +, or ∨ for alternation instead of the vertical bar.

'''Examples:'''
* <code>a|b*</code> denotes {ε, "a", "b", "bb", "bbb", ...}
* <code>(a|b)*</code> denotes the set of all strings with no symbols other than "a" and "b", including the empty string: {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}
* <code>ab*(c|ε)</code> denotes the set of strings starting with "a", then zero or more "b"s and finally optionally a "c": {"a", "ac", "ab", "abc", "abb", "abbc", ...}
* <code>(0|(1(01*0)*1))*</code> denotes the set of binary numbers that are multiples of 3: { ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ...}

===Expressive power and compactness===
The formal definition of regular expressions is minimal on purpose, and avoids defining <code>?</code> and <code>+</code>—these can be expressed as follows: <code>a+</code>=<code>aa*</code>, and <code>a?</code>=<code>(a|ε)</code>. Sometimes the [[set complement|complement]] operator is added, to give a ''generalized regular expression''; here ''R<sup>c</sup>'' matches all strings over Σ* that do not match ''R''. In principle, the complement operator is redundant, because it does not grant any more expressive power. However, it can make a regular expression much more concise—eliminating a single complement operator can cause a [[double exponential function|double exponential]] blow-up of its length.<ref>{{harvtxt|Gelade|Neven|2008|p=332|loc=Thm.4.1}}</ref><ref>{{harvtxt|Gruber|Holzer|2008}}</ref><ref>Based on {{harvtxt|Gelade|Neven|2008}}, a regular expression of length about 850 such that its complement has a length about 2<sup>32</sup> can be found at [[:File:RegexComplementBlowup.png]].</ref>

Regular expressions in this sense can express the regular languages, exactly the class of languages accepted by [[deterministic finite automata]]. There is, however, a significant difference in compactness. Some classes of regular languages can only be described by deterministic finite automata whose size grows [[exponential growth|exponentially]] in the size of the shortest equivalent regular expressions. The standard example here is the languages
''L<sub>k</sub>'' consisting of all strings over the alphabet {''a'',''b''} whose ''k''th-from-last letter equals&nbsp;''a''. On the one hand, a regular expression describing ''L''<sub>4</sub> is given by
<math>(a\mid b)^*a(a\mid b)(a\mid b)(a\mid b)</math>.

Generalizing this pattern to ''L<sub>k</sub>'' gives the expression:
: <math>(a\mid b)^*a\underbrace{(a\mid b)(a\mid b)\cdots(a\mid b)}_{k-1\text{ times}}. \, </math>

On the other hand, it is known that every deterministic finite automaton accepting the language ''L<sub>k</sub>'' must have at least 2<sup>''k''</sup> states. Luckily, there is a simple mapping from regular expressions to the more general [[nondeterministic finite automata]] (NFAs) that does not lead to such a blowup in size; for this reason NFAs are often used as alternative representations of regular languages. NFAs are a simple variation of the type-3 [[formal grammar|grammars]] of the [[Chomsky hierarchy]].<ref name="HopcroftMotwaniUllman01"/>

In the opposite direction, there are many languages easily described by a DFA that are not easily described by a regular expression. For instance, determining the validity of a given [[ISBN]] requires computing the modulus of the integer base 11, and can be easily implemented with an 11-state DFA. However, converting it to a regular expression results in a 2,14 megabytes file .<ref>{{cite web |title=Regular expressions for deciding divisibility |url=https://s3.boskent.com/divisibility-regex/divisibility-regex.html |access-date=2024-02-21 |website=s3.boskent.com}}</ref>

Given a regular expression, [[Thompson's construction algorithm]] computes an equivalent nondeterministic finite automaton. A conversion in the opposite direction is achieved by [[Kleene's algorithm]].

Finally, many real-world "regular expression" engines implement features that cannot be described by the regular expressions in the sense of formal language theory; rather, they implement ''regexes''. See [[#Patterns for non-regular languages|below]] for more on this.

===Deciding equivalence of regular expressions===
As seen in many of the examples above, there is more than one way to construct a regular expression to achieve the same results.

It is possible to write an [[algorithm]] that, for two given regular expressions, decides whether the described languages are equal; the algorithm reduces each expression to a [[minimal deterministic finite state machine]], and determines whether they are [[isomorphism|isomorphic]] (equivalent).

Algebraic laws for regular expressions can be obtained using a method by Gischer which is best explained along an example: In order to check whether (''X''+''Y'')<sup>&lowast;</sup> and (''X''<sup>&lowast;</sup> ''Y''<sup>&lowast;</sup>)<sup>&lowast;</sup> denote the same regular language, for all regular expressions ''X'', ''Y'', it is necessary and sufficient to check whether the particular regular expressions (''a''+''b'')<sup>&lowast;</sup> and (''a''<sup>&lowast;</sup> ''b''<sup>&lowast;</sup>)<sup>&lowast;</sup> denote the same language over the alphabet Σ={''a'',''b''}. More generally, an equation ''E''=''F'' between regular-expression terms with variables holds if, and only if, its instantiation with different variables replaced by different symbol constants holds.<ref>{{cite report |last=Gischer |first=Jay L. |institution=Stanford Univ., Dept. of Comp. Sc. |title=(Title unknown) |type=Technical Report |number=STAN-CS-TR-84-1033 |date=1984}}{{Title missing|date=February 2023}}</ref><ref>{{cite book |isbn=978-0-201-44124-6 |last1=Hopcroft |first1=John E. |last2=Motwani |first2=Rajeev |last3=Ullman |first3=Jeffrey D. |name-list-style=amp |title=Introduction to Automata Theory, Languages, and Computation |location=Upper Saddle River, New Jersey |publisher=Addison Wesley |date=2003 |pages=117–120 |quote=This property need not hold for extended regular expressions, even if they describe no larger class than regular languages; cf. p.121.}}</ref>

Every regular expression can be written solely in terms of the [[Kleene star]] and [[union (set theory)|set unions]] over finite words. This is a surprisingly difficult problem. As simple as the regular expressions are, there is no method to systematically rewrite them to some normal form. The lack of axiom in the past led to the [[star height problem]]. In 1991, [[Dexter Kozen]] axiomatized regular expressions as a [[Kleene algebra#History|Kleene algebra]], using equational and [[Horn clause]] axioms.<ref>{{harvtxt|Kozen|1991}}{{page needed|date=February 2015}}</ref>
Already in 1964, Redko had proved that no finite set of purely equational axioms can characterize the algebra of regular languages.<ref>{{cite journal |last=Redko |first=V.N. |url=http://umj.imath.kiev.ua/article/?article=10002 |title=On defining relations for the algebra of regular events |journal=Ukrainskii Matematicheskii Zhurnal |date=1964 |volume=16 |number=1 |pages=120–126 |access-date=2018-03-28 |archive-date=2018-03-29 |archive-url=https://web.archive.org/web/20180329121016/http://umj.imath.kiev.ua/article/?article=10002 |url-status=live |language=Russian}}</ref>