Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup Web Scraper on Binance Announcement Page Lags behind by 5 minutes

I have built a web scraper using bs4, where the purpose is to get notifications when a new announcement is posted. At the moment I am testing this with the word 'list' instead of all announcement keywords. For some reason when I compare the time it determines a new announcement has been posted versus the actual time it was posted in the website, the time is off by 5 minutes give or take.

This is actual time it was announced

from bs4 import BeautifulSoup
from requests import get
import time
import sys

x = True
while x == True:
    time.sleep(30)
    # Data for the scraping
    url = "https://www.binance.com/en/support/announcement"
    response = get(url)
    html_page = response.content
    soup = BeautifulSoup(html_page, 'html.parser')
    news_list = soup.find_all(class_ = 'css-qinc3w')

    # Create a bag of key words for getting matches
    key_words = ['list', 'token sale', 'open trading', 'opens trading', 'perpetual', 'defi', 'uniswap', 'airdrop', 'adds', 'updates', 'enabled', 'trade', 'support']

    # Empty list
    updated_list = []

    for news in news_list:
        article_text = news.text

        if ("list" in article_text.lower()):
            updated_list.append([article_text])

        if len(updated_list) > 4:
            print(time.asctime( time.localtime(time.time()) ))
            print(article_text)
            sys.exit()

The Response when the length of the list increased by 1 to 5 resulted in printing the following time, and new announcement: Fri May 28 04:17:39 2021, Binance Will List Livepeer (LPT)

I am unsure why this is. At first I thought I was being throttled, but looking again at robot.txt, I didn't see any reason why I should be. Moreover I included a sleep time of 30 seconds which should be more than enough to web scrape without any issues. Any help or an alternative solution would be much appreciated.

My question is:

Why is it 5 minutes behind? Why does it not notify me once the website posts it? The program takes 5 minutes longer to recognize there is a new post in comparison to the time it is posted on the website.

like image 368
Vic Rokx Avatar asked Oct 25 '25 11:10

Vic Rokx


2 Answers

from xrzz import http ## give it try using my simple scratch module
import json

url = "https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30"

req = http("GET", url=url, tls=True).body().decode()

key_words = ['list', 'token sale', 'open trading', 'opens trading', 'perpetual', 'defi', 'uniswap', 'airdrop', 'adds', 'updates', 'enabled', 'trade', 'support']

for i in json.loads(req)['data']['catalogs']:
    for o in i['articles']:
        if key_words[0] in o['title']:
            print(o['title'])

Ouput:

Result

like image 78
dedshit Avatar answered Oct 27 '25 01:10

dedshit


I think the problem is that the cloudflare server is caching documents or it was done deliberately by the binance programmers, so that a narrow circle of people could react to the news faster than everyone else.

This is a big problem if you want to get fresh data. If you look at the HTTP headers, you will notice that the Date: header is cached by the server, which means that the entire content of the document is cached. I managed to get 2 different Date: if I add or remove the gzip header. accept-encoding: gzip, deflate.

I am using the page

https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15

If you change the pageSize parameter, you can get fresh cached responses from the server. But that still doesn't solve the 5 minute delay issue. And I still see the old page.

Your link is :

https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30 like mine https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15 

is also cached for 5 seconds. And my guess is that there will be a 5 minute delay as well. I have not found a solution to this problem.

like image 21
user3210461 Avatar answered Oct 27 '25 01:10

user3210461



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!