Perl regular expression isn't greedy enough

Question

I'm writing a regular expression in perl to match perl code that starts the definition of a perl subroutine. Here's my regular expression:

my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*
)*\s*\{';

$regex matches code that starts a subroutine. I'm also trying to capture the name of the subroutine in $1 and any white space and comments between the subroutine name and the initial open brace in $2. It's $2 that is giving me a problem.

Consider the following perl code:

my $x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    $x = 2;
    return;
}

When I put this perl code into a string and match it against $regex, $2 is "# This is comment 3. ", not the three lines of comments that I want. I thought the regular expression would greedily put all three lines of comments into $2, but that seems not to be the case.

I would like to understand why $regex isn't working and to design a simple replacement. As the program below shows, I have a more complex replacement ($re3) that works. But I think it's important for me to understand why $regex doesn't work.

use strict;
use English;

my $code_string = <<END_CODE;
my \$x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    \$x = 2;
    return;
}
END_CODE

my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*
)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*
){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*
)+)?\s*\{';

print "\$code_string is '$code_string'
";
if  ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'
";}
if  ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'
";}
if  ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'
";}
exit 0;

__END__

The output of the perl script above is the following:

$code_string is 'my $x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    $x = 2;
    return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*
)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*
){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*
)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'

Ryan C. Thompson · Accepted Answer

Look at only the part of your regex that captures $2. It is (\s*#.* ). By itself, this can only capture a single comment line. You have an asterisk after it in order to capture multiple comment lines, and this works just fine. It captures multiple comment lines and puts each of them into $2, one by one, each time replacing the previous value of $2. So the final value of $2 when the regex is done matching is the last thing that the capturing group matched, which is the final comment line. Only. To fix it, you need to put the asterisk inside the capturing group. But then you need to put another set of parentheses (non-capturing, this time) to make sure the asterisk applies to the whole thing. So instead of (\s*#.* )*, you need ((?:\s*#.* )*).

Your third regex works because you unwittingly surrounded the whole expression in parentheses so that you could put a question mark after it. This caused $2 to capture all the comments at once, and $3 to capture only the final comment.

When you are debugging your regex, make sure you print out the values of all the match variables you are using: $1, $2, $3, etc. You would have seen that $1 was just the name of the subroutine and $2 was only the third comment. This might have led you to wonder how on earth your regex skipped over the first two comments when there is nothing between the first and second capturing groups, which would eventually lead you in the direction of discovering what happens when a capturing group matches multiple times.

~~By the way, it looks like you are also capturing any whitespace after the subroutine name into $1. Is this intentional?~~ (Oops, I messed up my mnemonics and thought \w was "w for whitespace".)

Andrew Clark · Answer

If you add repetition to a capturing group, it will only capture the final match of that group. This is why $regex only matches the final comment line.

Here is how I would rewrite you regex:

my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*
)*)\s*\{';

This is very similar to your $re3, except for the following changes:

The white space and comment matching portion is now in a non-capturing group
I changed that portion of the regex from ((...)+)? to ((...)*) which is equivalent.

Perl regular expression isn't greedy enough

Tags:

regex

perl

regex-greedy

David Levner

2 Answers

Ryan C. Thompson

Andrew Clark

Recent Activity

Donate For Us

Perl regular expression isn't greedy enough

Tags:

regex

perl

regex-greedy

David Levner

2 Answers

Ryan C. Thompson

Andrew Clark

Related questions

Recent Activity

Donate For Us