I am trying to scrape http://emojipedia.org/emoji/ , but I am not sure what is the most efficient way to do so. What I would like to scrape is found inside the table class ="emoji_list". I would like to save the stuff inside each "td" in separate columns. The output will be like the following where each line represent an emoji:
Col1_Link Col2_emoji Col3_Comment Col4_UTF
"/emoji/%F0%9F%98%80/" 😀 Grinning Face U+1F600
I have written the following code so far, but I am not sure what is the best way to do that.
import requests
from bs4 import BeautifulSoup
import urllib
import re
url = "http://emojipedia.org/emoji/"
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr', limit=2)
Many thanks in advance for your help.
soup.findAll('tr', limit=2) won't do much considering that just gets the first two trs on the page. You need to first find all the rows of the table then extract what you want which is inside the two tds in each tr:
import requests
from bs4 import BeautifulSoup
url = "http://emojipedia.org/emoji/"
html = requests.get(url).content
soup = BeautifulSoup(html)
table = soup.select_one("table.emoji-list")
for row in table.find_all("tr")[:5]:
td1, td2 = row.find_all("td")
em, desc = td1.text.split(None, 1)
print(td1.a["href"], em, desc, td2.text)
Another way would be to only get text without splitting would be to get the text from the a tag excluding the child text with find(text=True, recursive=False)
for row in table.find_all("tr"):
td1, td2 = row.find_all("td")
print(td1.a["href"], td1.a.span.text, td1.a.find(text=True, recursive=False), td2.text)
Also I would stick to using requests over urllib.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With