Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python append adding same data

I'm trying to extract the stock price and the market cap data from a Korean website.

Here is my code:

import requests
from bs4 import BeautifulSoup
 
response = requests.get('http://finance.naver.com/sise/sise_market_sum.nhn?sosok=0&page=1')
html = response.text
soup = BeautifulSoup(html, 'html.parser')

table = soup.find('table', { 'class': 'type_2' })
data = []
for tr in table.find_all('tr'):
    tds = list(tr.find_all('td')) 

    for td in tds:
        if td.find('a'):
            company_name = td.find('a').text 
            price_now = tds[2].text
            market_cap = tds[5].text 
            data.append([company_name, price_now, market_cap])    

 
print(*data, sep = "\n")

And this is the result I get. (Sorry for the Korean characters)

['삼성전자', '43,650', '100']

['', '43,650', '100']

['SK하이닉스', '69,800', '5,000']

['', '69,800', '5,000']

The second and the fourth line in the outcome should not be there. I just want the first and the third line. Where do line two and four come from and how do I get rid of them?

like image 481
K Lee Avatar asked Dec 22 '25 22:12

K Lee


2 Answers

My dear friend, I think the problem is you should check if td.find('a').text have values!

So I change your code to this and it works!

import requests
from bs4 import BeautifulSoup

response = requests.get(
    'http://finance.naver.com/sise/sise_market_sum.nhn?sosok=0&page=1')
html = response.text
soup = BeautifulSoup(html, 'html.parser')

table = soup.find('table', {'class': 'type_2'})
data = []
for tr in table.find_all('tr'):
    tds = list(tr.find_all('td'))

    for td in tds:
        # where magic happends!
        if td.find('a') and td.find('a').text:
            company_name = td.find('a').text
            price_now = tds[2].text
            market_cap = tds[5].text
            data.append([company_name, price_now, market_cap])

print(*data, sep="\n")
like image 143
Mark White Avatar answered Dec 24 '25 10:12

Mark White


While I can't test it, it could be because there are two a tags on the page you're trying to scrape, while your for loop and if statement is set up to append information whenever it finds an a tag. The first one has the name of the company, but the second one has no text, thus the blank output (because you do td.find('a').text, it tries to get the text of the target a tag).

For reference, this is the a tag you want:

<a href="/item/main.nhn?code=005930" class="tltle">삼성전자</a>

This is what you're picking up the second time around:

<a href="/item/board.nhn?code=005930"><img src="https://ssl.pstatic.net/imgstock/images5/ico_debatebl2.gif" width="15" height="13" alt="토론실"></a>

Perhaps you can change your if statement to make sure the class of the a tag is title or something to make sure that you only enter the if statement when you're looking at the a tag with the company name in it.

I'm at work so I can't really test anything, but let me know if you have any questions later!

like image 38
jwoff Avatar answered Dec 24 '25 11:12

jwoff



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!