Which regex works best?

I want to map the URLs to a common prefix (/ files), but under it there are only two specific directories - images and videos. For this, I can introduce two regular expressions:

/files/(images/.*|videos/.*)

and

/files/(image|video)s/.*

I have two questions:

  • Which one is better in terms of performance? My guess is the second, because its DFA will have fewer states.
  • Is there a general-purpose programming language, the built-in regular expression compiler will reduce this regular expression to the minimum DFA?

Performance matters to me, as I will use it to match billions of lines. Thus, any slight improvement matters to me.

+3
source share
3 answers

Python 2.7 :

import timeit
once = 'import re; m="/files/images/test"'
num = 1000000
print timeit.timeit(stmt='re.findall(r"/files/(images/.*|videos/.*)", m)', setup=once, number=num)
-> 1.5884420871734619
print timeit.timeit(stmt='re.findall(r"/files/(image|video)s/.*", m)', setup=once, number=num)
-> 1.5990869998931885

1 .

Python ...

  • ///
  • ///
  • //viddeos/

(/files/(images/.*|videos/.*)) (0,1 )

0

? - , DFA .

DFA, DFA "" ( ).

DFA, , .

- , , . , , - .

, DFA?

Go (re2), Haskell , DFA. , DFA .

POSIX ERE ( ), POSIX ERE, DFA NFA. , ​​, ( ), , , NFA.

, , GNU egrep , POSIX ERE DFA ( dfa.c).


3 :

  • DFA
  • NFA
  • NFA

( )

  • () NFA ( ).
  • (, Java, Python, Perl 2,...)
    2: Perl memoization.
  • , Thompson NFA O (mn), m - , n - .
  • (ir) backreckference - . .
  • ( ) , Thompson NFA.

, re2 ( ) DFA NFA.

+2

( , ) , . :

  • .*?
  • ?
  • ?

, :

^/files/(?:image|video)s/
0

All Articles