I'm writing a regular expression in perl to match perl code that starts the definition of a perl subroutine. Here's my regular expression:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
$regex matches code that starts a subroutine. I'm also trying to capture the name of the subroutine in $1 and any white space and comments between the subroutine name and the initial open brace in $2. It's $2 that is giving me a problem.
Consider the following perl code:
my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
}
When I put this perl code into a string and match it against $regex, $2 is "# This is comment 3.\n", not the three lines of comments that I want. I thought the regular expression would greedily put all three lines of comments into $2, but that seems not to be the case.
I would like to understand why $regex isn't working and to design a simple replacement. As the program below shows, I have a more complex replacement ($re3) that works. But I think it's important for me to understand why $regex doesn't work.
use strict;
use English;
my $code_string = <<END_CODE;
my \$x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
\$x = 2;
return;
}
END_CODE
my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{';
print "\$code_string is '$code_string'\n";
if ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'\n";}
if ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'\n";}
if ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'\n";}
exit 0;
__END__
The output of the perl script above is the following:
$code_string is 'my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'
Look at only the part of your regex that captures $2. It is (\s*#.*\n). By itself, this can only capture a single comment line. You have an asterisk after it in order to capture multiple comment lines, and this works just fine. It captures multiple comment lines and puts each of them into $2, one by one, each time replacing the previous value of $2. So the final value of $2 when the regex is done matching is the last thing that the capturing group matched, which is the final comment line. Only. To fix it, you need to put the asterisk inside the capturing group. But then you need to put another set of parentheses (non-capturing, this time) to make sure the asterisk applies to the whole thing. So instead of (\s*#.*\n)*, you need ((?:\s*#.*\n)*).
Your third regex works because you unwittingly surrounded the whole expression in parentheses so that you could put a question mark after it. This caused $2 to capture all the comments at once, and $3 to capture only the final comment.
When you are debugging your regex, make sure you print out the values of all the match variables you are using: $1, $2, $3, etc. You would have seen that $1 was just the name of the subroutine and $2 was only the third comment. This might have led you to wonder how on earth your regex skipped over the first two comments when there is nothing between the first and second capturing groups, which would eventually lead you in the direction of discovering what happens when a capturing group matches multiple times.
By the way, it looks like you are also capturing any whitespace after the subroutine name into (Oops, I messed up my mnemonics and thought $1. Is this intentional?\w was "w for whitespace".)
If you add repetition to a capturing group, it will only capture the final match of that group. This is why $regex only matches the final comment line.
Here is how I would rewrite you regex:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*\n)*)\s*\{';
This is very similar to your $re3, except for the following changes:
((...)+)? to ((...)*) which is equivalent.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With