I tried to fetch source of 4chan site, and get links to threads.
I have problem with regexp (isn't working). Source:
import urllib2, re
req = urllib2.Request('http://boards.4chan.org/wg/')
resp = urllib2.urlopen(req)
html = resp.read()
print re.findall("res/[0-9]+", html)
#print re.findall("^res/[0-9]+$", html)
The problem is that:
print re.findall("res/[0-9]+", html)
is giving duplicates.
I can't use:
print re.findall("^res/[0-9]+$", html)
I have read python docs but they didn't help.
That's because there are multiple copies of the link in the source.
You can easily make them unique by putting them in a set.
>>> print set(re.findall("res/[0-9]+", html))
set(['res/3833795', 'res/3837945', 'res/3835377', 'res/3837941', 'res/3837942',
'res/3837950', 'res/3100203', 'res/3836997', 'res/3837643', 'res/3835174'])
But if you are going to do anything more complex than this, I'd recommend you use a library that can parse HTML. Either BeautifulSoup or lxml.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With