Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python findall finds only the last occurrence

Tags:

python

regex

I'm trying to extract some data from webpage using Python 2.7.5.

code:

p = re.compile(r'.*<section\s*id="(.+)">(.+)</section>.*')
str = 'df  <section id="1">2</section> fdd <section id="3">4</section> fd'
m = p.findall(str)
for eachentry in m:
    print 'id=[{}], text=[{}]'.format(eachentry[0], eachentry[1])

output:

id=[3], text=[4]

why it's extracting only the last occurrence? if i remove the last occurrence the first one is found

like image 999
4ntoine Avatar asked Dec 07 '25 17:12

4ntoine


2 Answers

The .* at the beginning is very greedy and it consumes till the last occurrence. In fact all the .* in the expression are very greedy. So, we make them non-greedy with ?, like this

p = re.compile(r'.*?<section\s*id="(.+?)">(.+?)</section>.*?')

And the output becomes

id=[1], text=[2]
id=[3], text=[4]

In fact, you can drop the first and last .* patterns and keep it simple like this

p = re.compile(r'<section\s*id="(.+?)">(.+?)</section>')
like image 158
thefourtheye Avatar answered Dec 09 '25 17:12

thefourtheye


Your regular expression needs to be changed as follows:

p = re.compile(r'<section\s*id="(.+?)">(.+?)</section>')
like image 41
Sunny Nanda Avatar answered Dec 09 '25 16:12

Sunny Nanda



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!