Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup Not able to get_text after using extract()

i am working on web scraping and i want just text from any website so i am using Beautiful Soup. Initially i found that get_text() method was also returning JavaScript code so to avoid i come across that i should use extract() method but now i have a weird problem that after extraction of script and style tag Beautiful Soup doesn't recognize its body even its present in new `html.

let me clear you first i was doing this

soup = BeautifulSoup(HTMLRawData, 'html.parser')
print(soup.body)

here print statement was printing all html data but when i do

soup = BeautifulSoup(rawData, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    print(soup.body)

Now its is printing None as element is not present but for debugging after that i did soup.prettify() then it print whole html including body tag and also there was no script and style tag :( now i am very confused that why its happening and if body is present than why its saying None please help thanks

and i am using Python 3 and bs4 and rawData is html extracted from website .

like image 378
maq Avatar asked Nov 18 '25 08:11

maq


1 Answers

Problem: Using this html example:

<html>
<style>just style</style>
<span>Main text.</span>
</html>

After extracting the style tag and calling get_text() it returns only the text it was supposed to remove. This due to a double newline in the html after using extract(). Call soup.contents before and after .extract() and you will see this issue.

Before extract():

[<html>\n<style>just style</style>\n<span>Main text.</span>\n</html>]

After extract():

[<html>\n\n<span>Main text.</span>\n</html>]

You can see the double newline between html and span. This issue brakes get_text() for some unknown reason. To validate this point remove the newlines in the example and it will work properly.

Solutions:

1.- Parse the soup again after the extract() call.

BeautifulSoup(str(soup), 'html.parser')

2.- Use a different parser.

BeautifulSoup(raw, 'html5lib')

Note: Solution #2 doesn't work if you extract two or more contiguous tags because you end up with double newline again.

Note: You will probably have to install this parser. Just do:

pip install html5lib
like image 103
Miguel A. Sanchez-Perez Avatar answered Nov 20 '25 00:11

Miguel A. Sanchez-Perez



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!