Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Regular expression
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Implementations and running times <span class="anchor" id="Implementations"></span>== There are at least three different [[algorithm]]s that decide whether and how a given regex matches a string. The oldest and fastest relies on a result in formal language theory that allows every [[nondeterministic finite automaton]] (NFA) to be transformed into a [[deterministic finite automaton]] (DFA). The DFA can be constructed explicitly and then run on the resulting input string one symbol at a time. Constructing the DFA for a regular expression of size ''m'' has the time and memory cost of [[Big O notation|''O'']](2<sup>''m''</sup>), but it can be run on a string of size ''n'' in time ''O''(''n''). Note that the size of the expression is the size after abbreviations, such as numeric quantifiers, have been expanded. An alternative approach is to simulate the NFA directly, essentially building each DFA state on demand and then discarding it at the next step. This keeps the DFA implicit and avoids the exponential construction cost, but running cost rises to ''O''(''mn''). The explicit approach is called the DFA algorithm and the implicit approach the NFA algorithm. Adding caching to the NFA algorithm is often called the "lazy DFA" algorithm, or just the DFA algorithm without making a distinction. These algorithms are fast, but using them for recalling grouped subexpressions, lazy quantification, and similar features is tricky.<ref>{{harvtxt|Cox|2007}}</ref><ref>{{harvtxt|Laurikari|2009}}</ref> Modern implementations include the re1-[[RE2 (software)|re2]]-sregex family based on Cox's code. The third algorithm is to match the pattern against the input string by [[backtracking]]. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like {{code|lang=perl|code=(a{{!}}aa)*b}} that contain both alternation and unbounded quantification and force the algorithm to consider an exponentially increasing number of sub-cases. This behavior can cause a security problem called [[Regular expression Denial of Service]] (ReDoS). Although backtracking implementations only give an exponential guarantee in the worst case, they provide much greater flexibility and expressive power. For example, any implementation which allows the use of backreferences, or implements the various extensions introduced by Perl, must include some kind of backtracking. Some implementations try to provide the best of both algorithms by first running a fast DFA algorithm, and revert to a potentially slower backtracking algorithm only when a backreference is encountered during the match. GNU grep (and the underlying gnulib DFA) uses such a strategy.<ref>{{cite web |title=gnulib/lib/dfa.c |url=https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/dfa.c |quote=If the scanner detects a transition on backref, it returns a kind of "semi-success" indicating that the match will have to be verified with a backtracking matcher. |access-date=2022-02-12 |archive-date=2021-08-18 |archive-url=https://web.archive.org/web/20210818191338/https://git.savannah.gnu.org/gitweb/?p=gnulib.git%3Ba%3Dblob%3Bf%3Dlib%2Fdfa.c}}</ref> Sublinear runtime algorithms have been achieved using [[Boyer–Moore string-search algorithm|Boyer-Moore (BM) based algorithms]] and related DFA optimization techniques such as the reverse scan.<ref>{{cite arXiv |last=Kearns |first=Steven |eprint=1308.3822 |title=Sublinear Matching With Finite Automata Using Reverse Suffix Scanning |date=August 2013 |class=cs.DS}}</ref> GNU grep, which supports a wide variety of POSIX syntaxes and extensions, uses BM for a first-pass prefiltering, and then uses an implicit DFA. Wu [[agrep]], which implements approximate matching, combines the prefiltering into the DFA in BDM (backward DAWG matching). NR-grep's BNDM extends the BDM technique with Shift-Or bit-level parallelism.<ref>{{cite journal |last1=Navarro |first1=Gonzalo |title=NR-grep: a fast and flexible pattern-matching tool |journal=Software: Practice and Experience |date=10 November 2001 |volume=31 |issue=13 |pages=1265–1312 |doi=10.1002/spe.411 |s2cid=3175806 |url=https://users.dcc.uchile.cl/~gnavarro/ps/spe01.pdf |access-date=21 November 2019 |archive-date=7 October 2020 |archive-url=https://web.archive.org/web/20201007183210/https://users.dcc.uchile.cl/~gnavarro/ps/spe01.pdf |url-status=live}}</ref> A few theoretical alternatives to backtracking for backreferences exist, and their "exponents" are tamer in that they are only related to the number of backreferences, a fixed property of some regexp languages such as POSIX. One naive method that duplicates a non-backtracking NFA for each backreference note has a complexity of {{tmath|{\mathrm O}(n^{2k+2})}} time and {{tmath|{\mathrm O}(n^{2k+1})}} space for a haystack of length n and k backreferences in the RegExp.<ref>{{cite web |title=travisdowns/polyregex |website=[[GitHub]] |url=https://github.com/travisdowns/polyregex |date=5 July 2019 |access-date=21 November 2019 |archive-date=14 September 2020 |archive-url=https://web.archive.org/web/20200914170309/https://github.com/travisdowns/polyregex |url-status=live}}</ref> A very recent theoretical work based on memory automata gives a tighter bound based on "active" variable nodes used, and a polynomial possibility for some backreferenced regexps.<ref>{{cite arXiv |last=Schmid |first=Markus L. |eprint=1903.05896 |title=Regular Expressions with Backreferences: Polynomial-Time Matching Techniques |date=March 2019 |class=cs.FL}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)