Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem with re.findall (duplicates)

Tags:

python

html

regex

I tried to fetch source of 4chan site, and get links to threads.

I have problem with regexp (isn't working). Source:

import urllib2, re

req = urllib2.Request('http://boards.4chan.org/wg/')
resp = urllib2.urlopen(req)
html = resp.read()

print re.findall("res/[0-9]+", html)
#print re.findall("^res/[0-9]+$", html)

The problem is that:

print re.findall("res/[0-9]+", html)

is giving duplicates.

I can't use:

print re.findall("^res/[0-9]+$", html)

I have read python docs but they didn't help.

like image 585
SnZ Avatar asked Dec 03 '25 08:12

SnZ


1 Answers

That's because there are multiple copies of the link in the source.

You can easily make them unique by putting them in a set.

>>> print set(re.findall("res/[0-9]+", html))
set(['res/3833795', 'res/3837945', 'res/3835377', 'res/3837941', 'res/3837942',
'res/3837950', 'res/3100203', 'res/3836997', 'res/3837643', 'res/3835174'])

But if you are going to do anything more complex than this, I'd recommend you use a library that can parse HTML. Either BeautifulSoup or lxml.

like image 77
Lennart Regebro Avatar answered Dec 05 '25 21:12

Lennart Regebro