Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split multiple (nested) python sub-regex definitions

Tags:

python

regex

I have a (possibly) line splitted definition file which has the following similar pattern:

group-definition "first-regex" "second-regex"

both sub-regex are actual regex, and I need to check for the "main" syntax. The Python return should get me the following data:

  • the actual group-definition syntax
  • the first regex I'll need to process furtherly as a standalone regex
  • the second regex I'll need to process furtherly again as the first one

Also, the sub-regex definitions might use both single and double quotes, so the following syntax could be correct also:

definition "first-regex.*" 'second-regex[0-9]' #some comment

I also need to find out if the syntax is somehow correct, so the following string won't be recognized as correct:

something-right "something wrong' 'really-\.wrong" wtf

That's because I need 2 regex to process afterwards, and without any further data added (unless it's a comment starting with both "#" or ";").

Unfortunately, my experience with regex is not that deep, but I know that using something like this won't work as expected:

[\.]* (\".+?\")|(\'.+?\')[\ ](\".+?\")|(\'.+?\')

I suppose that I'd need some deeper knowledge of how regex sub-groups work, but I've not been able to understand how to get them right yet.

I know that there're plenty of questions and answers about this kind of topic, but I wasn't able to find the right search context for this kind of issue.

like image 231
musicamante Avatar asked Nov 20 '25 02:11

musicamante


1 Answers

You're on the right track. I'll assume all the following are valid statements

definition 'regex1' "regex2"
definition   # Comment
    'regex1' # Comment
    'regex2'

You might want to look into named captures. your pattern should allow for comments or white space between each argument. And you must remember to use the re.S flag which will allow you to capture '\n' with '.'

import re

pattern = """(?P<definition>[\w\-]+)    # Your definition equivalent to [a-zA-Z\-_]
             (?P<break1>(\s|#.*?\n)*?)  # Optional to match comments and spaces
             (?P<reg1>\'.*?\'|\".*?\")  # Regex pattern1
             (?P<break2>(\s|#.*?\n)*?)  # Another optional break
             (?P<reg2>\'.*?\'|\".*?\")  # Pattern2 """

with open('your_document', 'r') as f:
     for match in re.finditer(pattern, f.read(), re.X | re.S):
         # do something with each match

re.X allows the pattern to be verbose. re.S as said before will allow you to match new lines in the break sub-groups. finditer is a very useful tool to match many times as it will find all non overlapping matches and yield the matches.

(?P<name>pattern) allows sub-captures to be accessed by name. So you can access them by

match['definintion']
match['reg1']
match['reg2']

Read the documentation for more info

like image 110
Ranga Avatar answered Nov 21 '25 14:11

Ranga



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!