Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping data from Reddit - No API Sept 2023

I know recently Reddit changed their way to handle APIs and it is very restrictive now. I am working on a school project and need Reddit data on Stocks (subredits: Wallstreetbets, StockMarket). I am currently trying to scrape the pages from Old Reddit but only get a few records out. I was expecting a lot more data.

I have the following code, and even though I have the num_pages_to_scrapre set to 5000, I only get 138 records out. I thought that maybe next_button is not working correctly, or I should change the time.sleep(2) but still I get the same results. Please help!

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

url = "https://old.reddit.com/r/wallstreetbets"
headers = {'User-Agent': 'Mozilla/5.0'}

data = []  # List to store post data

#Set the desired number of pages
num_pages_to_scrape = 5000

for counter in range(1, num_pages_to_scrape + 1):
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    posts = soup.find_all('div', class_='thing', attrs={'data-domain': 'self.wallstreetbets'})

    for post in posts:
        title = post.find('a', class_='title').text
        author = post.find('a', class_='author').text
        comments = post.find('a', class_='comments').text.split()[0]

        if comments == "comment":
            comments = 0

        likes = post.find("div", class_="score likes").text

        if likes == "•":
            likes = "None"
        
        # Extract the date information from the HTML
        date_element = post.find('time', class_='live-timestamp')
        date = date_element['datetime'] if date_element else "N/A"
        formatted_date = pd.to_datetime(date, utc=True).strftime('%Y-%m-%d %H:%M:%S')

        data.append([formatted_date, title, author, comments, likes])

    next_button = soup.find("span", class_="next-button")
    if next_button:
        next_page_link = next_button.find("a").attrs['href']
        url = next_page_link
    else:
        break

    time.sleep(2)

#Create df
columns = ['Date', 'Title', 'Author', 'Comments', 'Likes']
df = pd.DataFrame(data, columns=columns)

#Print the DataFrame
df
like image 660
TMO995 Avatar asked Oct 20 '25 00:10

TMO995


1 Answers

Here is an example skeleton, how you can use their Json API to download multiple pages of data (note: to get data in Json form, add .json at the end of the URL):

import json

import requests

# api doc: https://old.reddit.com/dev/api/
url = "https://reddit.com/r/wallstreetbets.json"
headers = {"User-Agent": "Mozilla/5.0"}

data = requests.get(url, headers=headers).json()

while True:
    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))
    for c in data["data"]["children"]:
        print(c["data"]["title"])

    url = "https://reddit.com/r/wallstreetbets.json?after=" + data["data"]["after"]
    data = requests.get(url, headers=headers).json()

    if not data:
        break

Prints:

...

Remember kids, bankruptcies are good for stonks                                                                                                                                                                    
Bullish news - Tesla mcflurry cyber spoon                                                                
King of dilution strikes again                                                                           
Unleashing the Power of YouTube for Alpha Gains    
When is PayPal going 🚀🚀🚀                                                                              
He's been right so far                                                                                                                                                                                             
FuelCell teaming up with Toyota. First ever Tri-gen plant up and running
Bad day to be an Ape                                                                                     

...
like image 130
Andrej Kesely Avatar answered Oct 22 '25 15:10

Andrej Kesely