How build a list in python from a text using a regex?

Question

I have the following bunch of text:

text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""

I want to split it as follows:

RE = r'(SECTION.*?SECTION)'
m = re.findall(RE, text, re.DOTALL)
sections = []
if m:
   for match in m:
        sections.append(match)

hoping that it will result in a list with 4 elements, but I ended up with only 2 elements.

['SECTION 1. .....', 'SECTION 3. .....']  # only showing the first letters of each element

Afterwards, I would like do the same for chapters and articles.

Any ideas?

Marco Bonelli · Accepted Answer

Assuming that the word SECTION only appears when there is a new "section" in your string, you can always use the default .split method, which is way easier than using regexps.

Here's an example:

text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""

delimiter = 'SECTION'
sections = [delimiter + s for s in text.split(delimiter)[1:]]

The result will be:

>>> sections
['SECTION 1. ...', 'SECTION 2. ...', 'SECTION 3. ...', 'SECTION 4. ...']

Wiktor Stribiżew · Answer

The problem you have with your regex is that you consume the second SECTION. Once the first SECTION is found, the lazy dot matching construct consumes as few characters as possible up to the next SECTION, and the match returned contains the two words and all in between. Thus, having 4 SECTIONs, you can only get two matches.

This can be solved with a regex two ways (see demo of all 3 regexps below at IDEONE).

Lazy dot matching with a lookahead (less efficient, not recommended)

print(re.findall(r"SECTION.*?(?=$|SECTION)", text, re.DOTALL))

When the regex engine finds the first SECTION it starts consuming characters checking for the end of string ($) or leftmost SECTION.

Unroll-the-loop method to replace the lazy quantifier (much more efficient, requires no DOTALL modifier to match newline symbols)

print(re.findall(r"SECTION[^S]*(?:S(?!ECTION)[^S]*)*", text))

Here, no lazy quantifier or lookahead with alternatives are necessary since the SECTION consumes the first SECTION substring, and then [^S]*(?:S(?!ECTION)[^S]*)* matches any substring that is not equal to SECTION (up to the next SECTION if present, or just anything else up to the end of string).

A safer similar expression that makes sure there is whitespace and digits followed by a dot after SECTION:

print(re.findall(r"SECTION\s+\d+\.[^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)*", text))

A regex explanation:

SECTION - matches SECTION literally
\s+ - 1 or more whitespace
\d+ - 1 or more digits
\. - literal dot
[^S]* - any character but S
(?:S(?!ECTION\s+\d+\.)[^S]*)* - 0 or more sequences of....
- S(?!ECTION\s+\d+\.) - S that is not followed by ECTION + 1 or more whitespaces + 1 or more digits + a dot
- [^S]* - any character but S

UPDATE

To obtain a dictionary in the form of {'SECTION 1' : '...', 'SECTION 2' : '...'}, you need to add 2 capturing groups around the key and value patterns, and then use the dict command. This works because re.findall returns tuples of captured texts if capturing groups (i.e. parentheses) are specified in the regex pattern (If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.):

print(dict(re.findall(r"(SECTION\s+\d+)\.\s*([^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)‌*)", text)))

See IDEONE demo

How build a list in python from a text using a regex?

Tags:

python

regex

python-3.x

nanounanue

2 Answers

Marco Bonelli

Wiktor Stribiżew

Recent Activity

Donate For Us

How build a list in python from a text using a regex?

Tags:

python

regex

python-3.x

nanounanue

2 Answers

Marco Bonelli

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us