Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge two regexes with variable number of capture groups

I'm trying to match either

(\S+)(=)([fisuo])

or

(\S+)(!)

And then have the results placed in a list (capture groups). All of my attempts result in extra, unwanted captures.

Here's some code:

#!/usr/bin/perl
#-*- cperl -*-
# $Id: test7,v 1.1 2023/04/10 02:57:12 bennett Exp bennett $
#

use strict;
use warnings;
use Data::Dumper;

foreach my $k ('debugFlags=s', 'verbose!') {
    my @v;

    # Below is the offensive looking code.  I was hoping for a regex
    # which would behave like this:

    if(@v = $k =~ m/^(\S+)(=)([fisuo])$/) {
      printf STDERR ("clownMatch = '$k' => %s\n\n", Dumper(\@v));
    } elsif(@v = $k =~ m/^(\S+)(!)$/) {
      printf STDERR ("clownMatch = '$k' => %s\n\n", Dumper(\@v));
    }

    @v = ();

    # This is one of my failed, aspirational matches.  I think I know
    # WHY it fails, but I don't know how to fix it.
    
    if(@v = $k =~ m/^(?:(\S+)(=)([fisuo]))|(?:(\S+)(!))$/) {
      printf STDERR ("hopefulMatch = '$k' => %s\n\n", Dumper(\@v));
    }
    printf STDERR "===\n";
}

exit(0);
__END__

Output:

clownMatch = 'debugFlags=s' => $VAR1 = [
          'debugFlags',
          '=',
          's'
        ];


hopefulMatch = 'debugFlags=s' => $VAR1 = [
          'debugFlags',
          '=',
          's',
          undef,
          undef
        ];


===
clownMatch = 'verbose!' => $VAR1 = [
          'verbose',
          '!'
        ];


hopefulMatch = 'verbose!' => $VAR1 = [
          undef,
          undef,
          undef,
          'verbose',
          '!'
        ];


===

There are more details in the code comments. The output is at the bottom of the code section. And the '!' character is just that. I'm not confusing it with some other not.

Update Mon Apr 10 23:15:40 PDT 2023:

With the wise input of several readers, it seems that this question decomposes into a few smaller questions.

Can a regex return a variable number of capture groups?

I haven't heard one way or the other.

Should one use a regex in this way, if it could?

Not without a compelling reason.

For my purposes, should I use a regex to create what is really a lexical-analyzer/parser?

No. I was using a regex for syntax checking and got carried away.

I learned a good deal, though. I hope moderators see fit to keep this post as a cautionary tale.

Everyone deserves points on this one, and can claim that they were robbed, citing this paragraph. @Schwern gets the points for being first. Thanks.

like image 242
Erik Bennett Avatar asked Oct 20 '25 23:10

Erik Bennett


1 Answers

All of my attempts result in extra, unwanted captures.

I'd go for the "branch reset" (?| pattern1 | pattern2 | ... ) like already suggested by @bobble_bubble (as comment only)

It's a generic solution to combine different patterns with groups, while resetting the capture-count.

Alas contrary to the docs he linked to, you'll still get undef slots at the end of the LISTs returned for patterns with less groups.

But if this really bothers you - personally I would keep them - you can safely filter them out with a grep {defined} like @zdim suggested.

That's safe since undef means a non-match and can't be confused with an empty match "".

Here the code covering your test cases

use v5.12.0;
use warnings;
use Data::Dump qw/pp ddx/;
use Test::More;

# https://stackoverflow.com/questions/75974097/merge-two-regexes-with-variable-number-of-capture-groups

my %wanted =
  (
   "debugFlags=s" => ["debugFlags", "=", "s"],
   "verbose!"     => ["verbose", "!"],
  );


while ( my ( $str, $expect) = each %wanted ) {
    my @got =
      $str =~ / (\S+)
                (?|
                    (=) ([fisuo]+)
                |
                    (!)
                )
              /x;

    ddx \@got;                          # with trailing undefs

    @got = grep {defined} @got;         # eliminate undefs

    is_deeply( \@got, $expect, "$str => ". pp(\@got));
}

done_testing();

-->

# branchreset.pl:25: ["debugFlags", "=", "s"]
ok 1 - debugFlags=s => ["debugFlags", "=", "s"]
# branchreset.pl:25: ["verbose", "!", undef]
ok 2 - verbose! => ["verbose", "!"]
1..2

strategic update

But again, I don't see the point in eliminating the undef slots at the end, since you will need to handle the different cases individually anyway.

And one day you might want to add patterns after the branch too. If branch-reset was really skipping the missing groups, that would change the numbering of trailing groups beyond recognition. So from a design perspective that's well done.

like image 176
LanX Avatar answered Oct 24 '25 19:10

LanX