Before I start, I know there are better ways than regex doing this (like tokenizers), that's not what the question is about. I'm already stuck using regex, and it already works as I need to, except one special case, which is what I need advice on.
I need to scan through some JavaScript-like code and insert the new keyword in front of every object declaration. I already know the names of all objects that will need this keyword, and I know that none of them will have that keyword in the code before I start (so I don't need to deal with repeated new words or guessing whether something is an object or not. For example, a typical line could look like this:
foo = Bar()
Where I would already know that Bar is a 'class' and would need 'new' for object declaration. The following regex does the trick:
for classname in allowed_classes:
    line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)\b(%s\s*\()' % classname, r'\1new \3', line)
It works like a charm, even making sure not to touch classname when it's inside a string (The first portion of the regex tells it to make sure there are even number of quotes before-hand - it's a bit naive in that it will break with nested quotes, but I don't need to handle that case). Problem is, class names could also have $ in them. So the following line is allowed as well if $Bar exists in allowed_classes:
foo = $Bar()
The above regex will ignore it, due to the dollar sign. I figured escaping it would do the trick, but this logic seems to have no effect on the above line even if $Bar is one of the classes:
for classname in allowed_classes:
    line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)\b(%s\s*\()' % re.escape(classname), r'\1new \3', line)
I also tried escaping it by hand using \ but it has no effect either. Can someone explain why converting $ to \$ isn't working and what could fix it?
Thanks
The reason your current regex isn't working is that you have a \b just before your class name.  \b will match word boundaries, so only between word characters and non-word characters.  For the string foo = Bar(), the \b will match between the space and the B, but for foo = $Bar(), the \b cannot match between the space and the $ because they are both non-word characters.
To fix this, change \b to (?=\b|\B\$), here is the resulting regex:
for classname in allowed_classes:
    line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)(?=\b|\B\$)(%s\s*\()' % classname, r'\1new \3', line)
By using a lookahead, you can handle both of the following cases:
classname does not start with $, so we want a word boundary before trying to match classname, the \b inside of the lookahead handles thisclassname does start with $, so if the next character is a $ we want to match.  I used \B\$ so it will only match if the character before the $ is not a word character, but this is probably unnecessary since I can't think of any valid JS code where that would be the caseIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With