Regular Expression performing poorly in Ruby compared to JavaScript

Question

Recently, I started using the email validation regular expression from the JQuery validation plugin in my Rails models.

EMAIL_REGEXP=/^((([a-z]|\d|[!#\$%&'\*\+\-/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i

"[email protected]".match EMAIL_REGEXP  # returns immidiately
"aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

The regular expression takes a long time when an invalid email has many dot separated tokens (E.g: first.middle.last@gmail). Same expression works without any noticable delay in JavaScript.

Why is there such a difference in performance between Ruby and JavaScript regular expression parsers? Is there anything I can do to improve the response time?

I am on Ruby 1.8.7. I don't see the same issue on Ruby 1.9.2.

Note

I know the reg-exp is long. Since it is used by jQuery, I thought of using it. I can always change it back to a simpler regexp as shown here. My question is mostly about finding out the reason why the same regular expression is much faster in JS.

Reference:

JQuery Validation Plugin Source

Sample form with jQuery email validation

KL-7 · Accepted Answer

Don't know why regex parser from 1.8.7 is so much slower than the one from JS or Oniguruma from 1.9.2, but may be this particular regex can benefit from wrapping its prefix including @ symbol with atomic group like that:

EMAIL_REGEXP = /
  ^
  (?>(( # atomic group start
    ([a-z]|\d|[!#\$%&'\*\+\-/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
    (\.([a-z]|\d|[!#\$%&'\*\+\-/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
   )
   |
   (
     (\x22)
     (
       (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
       (
         ([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
         |
         (\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
       )
     )*
     (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
     (\x22)
   )
  )
  @) # atomic group end
  (
    (
      ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      |
      (
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
        ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      )
    )
    \.
  )+
  (
    ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    |
    (
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    )
  )
  $
  /xi

puts "[email protected]".match EMAIL_REGEXP  # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

Atomic group in that case should prevent parser from returning to the first part of the string when matching the part following @ symbol failed. And it gives significant speed-up. Though, I'm not 100% sure that it doesn't break regexp logic, so I'd appreciate any comments.

Another thing is using non-capturing groups that should be faster in general when you don't need to backreference for groups, but they don't give any noticeable improvement in this case.

Regular Expression performing poorly in Ruby compared to JavaScript

Tags:

javascript

regex

ruby

ruby-on-rails

Harish Shetty

1 Answers

KL-7

Recent Activity

Donate For Us

Regular Expression performing poorly in Ruby compared to JavaScript

Tags:

javascript

regex

ruby

ruby-on-rails

Harish Shetty

1 Answers

KL-7

Related questions

Recent Activity

Donate For Us