Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python BeautifulSoup4, convert string into arguments for find()

I have a simple Python script that uses BeautifulSoup to find a section of the HTML tree. For example, to find everything inside the <div id="doctext"> tags, the script does this:

html_section = str(soup.find("div", id="doctext"))

I would like to be able to make the arguments to find() vary, however, according to strings given in an input file. For example, a user could feed the script a URL followed by a string like "div", id="doctext", and the script would adjust the find accordingly. Imagine that the input file looks like this:

http://www.example.com | "div", id="doctext"

The script splits the line to get the URL, which works fine, but I want it to also grab the arguments. For example:

vars = line.split(' | ')
html = urllib2.urlopen(vars[0]).read()
soup = BeautifulSoup(html)
args = vars[1].split()
html_section = str(soup.find(*args))

This doesn't work---and probably doesn't make sense as I've been trying multiple ways to do this. How do I get the string provided by the input file and prepare it into the right syntax for the soup.find() function?

like image 551
Caleb McDaniel Avatar asked Jan 16 '26 20:01

Caleb McDaniel


1 Answers

You could parse line like this:

line = 'http://www.example.com | div, id=doctext'
url, args = line.split(' | ', 1)
args = args.split(',')
name = args[0]
params = dict([param.strip().split('=') for param in args[1:]])
print(name)
print(params)

yields

div
{'id': 'doctext'}

Then you could call soup.find like this:

html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
html_section = str(soup.find(name, **params))

WARNING: Note that if doctext (or some other keyword argument) contains a comma, then

args = args.split(',')

will split the parameters in the wrong place. This problem might arise if you are searching for some text content that contains a comma.


So let's look for a better solution:

To avoid the problem described above, you might consider using the JSON format for the arguments: if line looks like this:

'http://www.example.com | ["div", {"id": "doctext"}]'

Then you could parse it with

import json
line = 'http://www.example.com | ["div", {"id": "doctext"}]'
url, arguments = line.split('|', 1)
url = url.strip()
arguments = json.loads(arguments)
args = []
params = {}
for item in arguments:
    if isinstance(item, dict):
        params = item
    else:
        args.append(item)

print(args)
print(params)

which yields

[u'div']
{u'id': u'doctext'}

Then you could call soup.find with

html_section = str(soup.find(*args, **params))

An added advantage is that you can supply any number of soup.find's positional arguments (for name, attrs, recursive, and text), not just the name.

like image 159
unutbu Avatar answered Jan 19 '26 18:01

unutbu