Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression performing poorly in Ruby compared to JavaScript

Recently, I started using the email validation regular expression from the JQuery validation plugin in my Rails models.

EMAIL_REGEXP=/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i

"[email protected]".match EMAIL_REGEXP  # returns immidiately
"aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

The regular expression takes a long time when an invalid email has many dot separated tokens (E.g: first.middle.last@gmail). Same expression works without any noticable delay in JavaScript.

Why is there such a difference in performance between Ruby and JavaScript regular expression parsers? Is there anything I can do to improve the response time?

I am on Ruby 1.8.7. I don't see the same issue on Ruby 1.9.2.

Note

I know the reg-exp is long. Since it is used by jQuery, I thought of using it. I can always change it back to a simpler regexp as shown here. My question is mostly about finding out the reason why the same regular expression is much faster in JS.

Reference:

JQuery Validation Plugin Source

Sample form with jQuery email validation

like image 910
Harish Shetty Avatar asked Apr 08 '26 09:04

Harish Shetty


1 Answers

Don't know why regex parser from 1.8.7 is so much slower than the one from JS or Oniguruma from 1.9.2, but may be this particular regex can benefit from wrapping its prefix including @ symbol with atomic group like that:

EMAIL_REGEXP = /
  ^
  (?>(( # atomic group start
    ([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
    (\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
   )
   |
   (
     (\x22)
     (
       (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
       (
         ([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
         |
         (\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
       )
     )*
     (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
     (\x22)
   )
  )
  @) # atomic group end
  (
    (
      ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      |
      (
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
        ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      )
    )
    \.
  )+
  (
    ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    |
    (
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    )
  )
  $
  /xi

puts "[email protected]".match EMAIL_REGEXP  # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

Atomic group in that case should prevent parser from returning to the first part of the string when matching the part following @ symbol failed. And it gives significant speed-up. Though, I'm not 100% sure that it doesn't break regexp logic, so I'd appreciate any comments.

Another thing is using non-capturing groups that should be faster in general when you don't need to backreference for groups, but they don't give any noticeable improvement in this case.

like image 150
KL-7 Avatar answered Apr 10 '26 23:04

KL-7



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!