Most modern programming languages have regular expression engines. Here I'll compare the speed of each regular expression engines with long text matching which has backtracking. I used Ruby, Perl and Python for this examination.
A time-consuming regular expression example
/(aa)*aa/ =~ 'aa'
needs backtracking. The longer the length is, the longer it would take. Let m
as the length of the characters and n
as the number of repeat. For example the case when n = 2
and m = 3
is
/(aa)*aa(bb)*bb(cc)*cc/ =~ 'aabbcc'
Codes
Ruby:
def time
t = Time.now
yield
Time.now - t
end
def ryutorion(m, n)
chars = (0...n).map {|i| ('a'.ord + i % 26).chr }
regex = Regexp.new chars.map {|c| "(#{c*m})*#{c*m}" }.join
str = chars.map {|c| c*m }.join
time { regex =~ str }
end
Python:
import re
import time
def timeof(f):
t = time.time()
f()
return time.time() - t
def ryutorion(m, n):
chars = [ chr(ord('a') + i % 26) for i in range(n) ]
regex = re.compile(''.join([ '(' + c*m + ')*' + c*m for c in chars ]))
str = ''.join([ c*m for c in chars ])
return timeof(lambda: regex.match(str))
Perl:
use warnings;
use strict;
use Time::HiRes qw(gettimeofday tv_interval);
sub gen_regex {
my ($m, $n) = @_;
join "", map {
my $s = chr(ord('a') + ($_ % 26)) x $m;
"($s)*$s"
} (0 .. $n - 1);
}
sub gen_input {
my ($m, $n) = @_;
join "", map { chr(ord('a') + ($_ % 26)) x $m } (0 .. $n - 1);
}
my $r = gen_regex 1, 1;
my $i = gen_input 1, 1;
$r = qr/$r/;
my $ts = [gettimeofday];
my $te = [gettimeofday] if($i =~ /$r/);
my $elapsed = tv_interval $ts, $te;
print "$elapsed\n";
for my $N (10, 20) {
for my $M (10000, 20000) {
$r = gen_regex $M, $N;
$i = gen_input $M, $N;
$r = qr/$r/;
$ts = [gettimeofday];
$te = [gettimeofday] if($i =~ /$r/);
$elapsed = tv_interval $ts, $te;
print "$elapsed\n";
}
}
This perl example was written by Masahiro Kimoto.
Result
shorter test: n = {10000, 20000}
= m = {10, 20}
Ruby
p ryutorion(10000, 10) #=> 0.000278 p ryutorion(20000, 10) #=> 0.000698 p ryutorion(10000, 20) #=> 0.000538 p ryutorion(20000, 20) #=> 0.001026
Python
print ryutorion(10000, 10) #=> 0.00126099586487 print ryutorion(20000, 10) #=> 0.00168895721436 print ryutorion(10000, 20) #=> 0.00180792808533 print ryutorion(20000, 20) #=> 0.00338697433472
Perl
($M = 10000, $N = 10) #=> 0.000066 ($M = 20000, $N = 10) #=> 0.000185 ($M = 10000, $N = 20) #=> 0.000163 ($M = 20000, $N = 20) #=> 0.000294
longer test: n = {10000, 20000}
= m = {100, 200}
Ruby (ruby 1.9.2)
0.002721 0.00672 0.005441 0.013364
Python (python 2.7)
Traceback (most recent call last): File "a.py", line 16, in <module> print ryutorion(10000, 100) File "a.py", line 11, in ryutorion regex = re.compile(''.join([ '(' + c*m + ')*' + c*m for c in chars ])) File "/usr/local/Cellar/python/2.7/lib/python2.7/re.py", line 190, in compile return _compile(pattern, flags) File "/usr/local/Cellar/python/2.7/lib/python2.7/re.py", line 243, in _compile p = sre_compile.compile(pattern, flags) File "/usr/local/Cellar/python/2.7/lib/python2.7/sre_compile.py", line 511, in compile "sorry, but this version only supports 100 named groups" AssertionError: sorry, but this version only supports 100 named groups
Perl (perl 5.10.0)
0.000981 0.00269 0.004181 0.005484
Summary
Perl is the strongest. Use Perl.