Standards: Regular Expression Benchmarks

Most modern programming languages have regular expression engines. Here I'll compare the speed of each regular expression engines with long text matching which has backtracking. I used Ruby, Perl and Python for this examination.

A time-consuming regular expression example

/(aa)*aa/ =~ 'aa' needs backtracking. The longer the length is, the longer it would take. Let m as the length of the characters and n as the number of repeat. For example the case when n = 2 and m = 3 is

/(aa)*aa(bb)*bb(cc)*cc/ =~ 'aabbcc'

Codes

Ruby:

def time
  t = Time.now
  yield
  Time.now - t
end

def ryutorion(m, n)
  chars = (0...n).map {|i| ('a'.ord + i % 26).chr }
  regex = Regexp.new chars.map {|c| "(#{c*m})*#{c*m}" }.join
  str = chars.map {|c| c*m }.join
  time { regex =~ str }
end

Python:

import re
import time

def timeof(f):
    t = time.time()
    f()
    return time.time() - t

def ryutorion(m, n):
    chars = [ chr(ord('a') + i % 26) for i in range(n) ]
    regex = re.compile(''.join([ '(' + c*m + ')*' + c*m for c in chars ]))
    str = ''.join([ c*m for c in chars ])
    return timeof(lambda: regex.match(str))

Perl:

use warnings;
use strict;
use Time::HiRes qw(gettimeofday tv_interval);

sub gen_regex {
    my ($m, $n) = @_;

    join "", map { 
        my $s = chr(ord('a') + ($_ % 26)) x $m;
        "($s)*$s"
    } (0 .. $n - 1);
}

sub gen_input {
    my ($m, $n) = @_;

    join "", map { chr(ord('a') + ($_ % 26)) x $m } (0 .. $n - 1);
}

my $r = gen_regex 1, 1;
my $i = gen_input 1, 1;
$r = qr/$r/;
my $ts = [gettimeofday];
my $te = [gettimeofday] if($i =~ /$r/);
my $elapsed = tv_interval $ts, $te;
print "$elapsed\n";

for my $N (10, 20) {
    for my $M (10000, 20000) {
        $r = gen_regex $M, $N;
        $i = gen_input $M, $N;
        $r = qr/$r/;
        $ts = [gettimeofday];
        $te = [gettimeofday] if($i =~ /$r/);
        $elapsed = tv_interval $ts, $te;
        print "$elapsed\n";
    }
}

This perl example was written by Masahiro Kimoto.

Result

shorter test: n = {10000, 20000} = m = {10, 20}

Ruby

p ryutorion(10000, 10) #=> 0.000278
p ryutorion(20000, 10) #=> 0.000698
p ryutorion(10000, 20) #=> 0.000538
p ryutorion(20000, 20) #=> 0.001026

Python

print ryutorion(10000, 10) #=> 0.00126099586487
print ryutorion(20000, 10) #=> 0.00168895721436
print ryutorion(10000, 20) #=> 0.00180792808533
print ryutorion(20000, 20) #=> 0.00338697433472

Perl

($M = 10000, $N = 10) #=> 0.000066
($M = 20000, $N = 10) #=> 0.000185
($M = 10000, $N = 20) #=> 0.000163
($M = 20000, $N = 20) #=> 0.000294

longer test: n = {10000, 20000} = m = {100, 200}

Ruby (ruby 1.9.2)
```
0.002721
0.00672
0.005441
0.013364
```

Python (python 2.7)

Traceback (most recent call last):
  File "a.py", line 16, in <module>
    print ryutorion(10000, 100)
  File "a.py", line 11, in ryutorion
    regex = re.compile(''.join([ '(' + c*m + ')*' + c*m for c in chars ]))
  File "/usr/local/Cellar/python/2.7/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/local/Cellar/python/2.7/lib/python2.7/re.py", line 243, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/Cellar/python/2.7/lib/python2.7/sre_compile.py", line 511, in compile
    "sorry, but this version only supports 100 named groups"
AssertionError: sorry, but this version only supports 100 named groups

Perl (perl 5.10.0)
```
0.000981
0.00269
0.004181
0.005484
```

Summary

Perl is the strongest. Use Perl.

Standards

Monday, August 16, 2010

Regular Expression Benchmarks

A time-consuming regular expression example

Codes

Result

Summary

No comments:

Post a Comment

About Me

Blog Archive

Followers