I have a string that I'm trying to validate against a few regex patterns and I was hoping since Pattern matching is available in 3.10, I might be able to use that instead of creating an if-else block.
Consider a string 'validateString' with possible values 1021102,1.25.32, string021.
The code I tried would be something like the following.
match validateString:
    case regex1:
        print('Matched regex1')
    case regex2:
        print('Matched regex2')
    case regex3:
        print('Matched regex3')
For regex 1, 2 and 3, I've tried string regex patterns and also re.compile objects but it doesn't seem to work.
I have been trying to find examples of this over the internet but can't seem to find any that cover regex pattern matching with the new python pattern matching.
Any ideas for how I can make it work?
Thanks!
I condensed this answer into a python package to make matching as easy as pip install regex-spm,
import regex_spm
match regex_spm.fullmatch_in("abracadabra"):
  case r"\d+": print("It's all digits")
  case r"\D+": print("There are no digits in the search string")
  case _: print("It's something else")
As Patrick Artner correctly points out in the other answer, there is currently no official way to do this. Hopefully the feature will be introduced in a future Python version and this question can be retired. Until then:
PEP 634 specifies that Structural Pattern Matching uses the == operator for evaluating a match. We can override that.
import re
from dataclasses import dataclass
# noinspection PyPep8Naming
@dataclass
class regex_in:
    string: str
    def __eq__(self, other: str | re.Pattern):
        if isinstance(other, str):
            other = re.compile(other)
        assert isinstance(other, re.Pattern)
        # TODO extend for search and match variants
        return other.fullmatch(self.string) is not None
Now you can do something like:
match regex_in(validated_string):
    case r'\d+':
        print('Digits')
    case r'\s+':
        print('Whitespaces')
    case _:
        print('Something else')
Caveat #1 is that you can't pass re.compile'd patterns to the case directly, because then Python wants to match based on class. You have to save the pattern somewhere first.
Caveat #2 is that you can't actually use local variables either, because Python then interprets it as a name for capturing the match subject. You need to use a dotted name, e.g. putting the pattern into a class or enum:
class MyPatterns:
    DIGITS = re.compile('\d+')
match regex_in(validated_string):
    case MyPatterns.DIGITS:
        print('This works, it\'s all digits')
This could be extended even further to provide an easy way to access the re.Match object and the groups.
# noinspection PyPep8Naming
@dataclass
class regex_in:
    string: str
    match: re.Match = None
    def __eq__(self, other: str | re.Pattern):
        if isinstance(other, str):
            other = re.compile(other)
        assert isinstance(other, re.Pattern)
        # TODO extend for search and match variants
        self.match = other.fullmatch(self.string)
        return self.match is not None
    def __getitem__(self, group):
        return self.match[group]
# Note the `as m` in in the case specification
match regex_in(validated_string):
    case r'\d(\d)' as m:
        print(f'The second digit is {m[1]}')
        print(f'The whole match is {m.match}')
There is a clean solution to this problem. Just hoist the regexes out of the case-clauses where they aren't supported and into the match-clause which supports any Python object.
The combined regex will also give you better efficiency than could be had by having a series of separate regex tests. Also, the regex can be precompiled for maximum efficiency during the match process.
Here's a worked out example for a simple tokenizer:
pattern = re.compile(r'(\d+\.\d+)|(\d+)|(\w+)|(".*)"')
Token = namedtuple('Token', ('kind', 'value', 'position'))
env = {'x': 'hello', 'y': 10}
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
    mo = pattern.fullmatch(s)
    match mo.lastindex:
        case 1:
            tok = Token('NUM', float(s), mo.span())
        case 2:
            tok = Token('NUM', int(s), mo.span())
        case 3:
            tok = Token('VAR', env[s], mo.span())
        case 4:
            tok = Token('TEXT', s[1:-1], mo.span())
        case _:
            raise ValueError(f'Unknown pattern for {s!r}')
    print(tok) 
This outputs:
Token(kind='NUM', value=123, position=(0, 3))
Token(kind='NUM', value=123.45, position=(0, 6))
Token(kind='VAR', value='hello', position=(0, 1))
Token(kind='VAR', value=10, position=(0, 1))
Token(kind='TEXT', value='goodbye', position=(0, 9))
The code can be improved by writing the combined regex in verbose format for intelligibility and ease of adding more cases. It can be further improved by naming the regex sub patterns:
pattern = re.compile(r"""(?x)
    (?P<float>\d+\.\d+) |
    (?P<int>\d+) |
    (?P<variable>\w+) |
    (?P<string>".*")
""")
That can be used in a match/case statement like this:
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
    mo = pattern.fullmatch(s)
    match mo.lastgroup:
        case 'float':
            tok = Token('NUM', float(s), mo.span())
        case 'int':
            tok = Token('NUM', int(s), mo.span())
        case 'variable':
            tok = Token('VAR', env[s], mo.span())
        case 'string':
            tok = Token('TEXT', s[1:-1], mo.span())
        case _:
            raise ValueError(f'Unknown pattern for {s!r}')
    print(tok)
Here is the equivalent code written using an if-elif-else chain:
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
    if (mo := re.fullmatch('\d+\.\d+', s)):
        tok = Token('NUM', float(s), mo.span())
    elif (mo := re.fullmatch('\d+', s)):
        tok = Token('NUM', int(s), mo.span())
    elif (mo := re.fullmatch('\w+', s)):
        tok = Token('VAR', env[s], mo.span())
    elif (mo := re.fullmatch('".*"', s)):
        tok = Token('TEXT', s[1:-1], mo.span())
    else:
        raise ValueError(f'Unknown pattern for {s!r}')
    print(tok)
Compared to the match/case, the if-elif-else chain is slower because it runs multiple regex matches and because there is no precompilation. Also, it is less maintainable without the case names.
Because all the regexes are separate we have to capture all the match objects separately with repeated use of assignment expressions with the walrus operator. This is awkward compared to the match/case example where we only make a single assignment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With