I know recently Reddit changed their way to handle APIs and it is very restrictive now. I am working on a school project and need Reddit data on Stocks (subredits: Wallstreetbets, StockMarket). I am currently trying to scrape the pages from Old Reddit but only get a few records out. I was expecting a lot more data.
I have the following code, and even though I have the num_pages_to_scrapre
set to 5000, I only get 138 records out.
I thought that maybe next_button
is not working correctly, or I should change the time.sleep(2)
but still I get the same results.
Please help!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
url = "https://old.reddit.com/r/wallstreetbets"
headers = {'User-Agent': 'Mozilla/5.0'}
data = [] # List to store post data
#Set the desired number of pages
num_pages_to_scrape = 5000
for counter in range(1, num_pages_to_scrape + 1):
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
posts = soup.find_all('div', class_='thing', attrs={'data-domain': 'self.wallstreetbets'})
for post in posts:
title = post.find('a', class_='title').text
author = post.find('a', class_='author').text
comments = post.find('a', class_='comments').text.split()[0]
if comments == "comment":
comments = 0
likes = post.find("div", class_="score likes").text
if likes == "•":
likes = "None"
# Extract the date information from the HTML
date_element = post.find('time', class_='live-timestamp')
date = date_element['datetime'] if date_element else "N/A"
formatted_date = pd.to_datetime(date, utc=True).strftime('%Y-%m-%d %H:%M:%S')
data.append([formatted_date, title, author, comments, likes])
next_button = soup.find("span", class_="next-button")
if next_button:
next_page_link = next_button.find("a").attrs['href']
url = next_page_link
else:
break
time.sleep(2)
#Create df
columns = ['Date', 'Title', 'Author', 'Comments', 'Likes']
df = pd.DataFrame(data, columns=columns)
#Print the DataFrame
df
Here is an example skeleton, how you can use their Json
API to download multiple pages of data (note: to get data in Json form, add .json
at the end of the URL):
import json
import requests
# api doc: https://old.reddit.com/dev/api/
url = "https://reddit.com/r/wallstreetbets.json"
headers = {"User-Agent": "Mozilla/5.0"}
data = requests.get(url, headers=headers).json()
while True:
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for c in data["data"]["children"]:
print(c["data"]["title"])
url = "https://reddit.com/r/wallstreetbets.json?after=" + data["data"]["after"]
data = requests.get(url, headers=headers).json()
if not data:
break
Prints:
...
Remember kids, bankruptcies are good for stonks
Bullish news - Tesla mcflurry cyber spoon
King of dilution strikes again
Unleashing the Power of YouTube for Alpha Gains
When is PayPal going 🚀🚀🚀
He's been right so far
FuelCell teaming up with Toyota. First ever Tri-gen plant up and running
Bad day to be an Ape
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With