Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape data once a day and write it to csv

i'm a total noobie, i'm just starting with web scraping as a hobby.

I want to scrape data from forum (total numer of post, total numer of subjects and numer of all users) from https://www.fly4free.pl/forum/

photo of which data I want to scrape

Watching some turotirals i've came to this code:

from bs4 import BeautifulSoup
import requests
import datetime
import csv

source = requests.get('https://www.fly4free.pl/forum/').text
soup = BeautifulSoup(source, 'lxml')

csv_file = open('4fly_forum.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Data i godzina', 'Wszytskich postów', 'Wszytskich tematów', 'Wszytskich użytkowników'])

czas = datetime.datetime.now()
czas = czas.strftime("%Y-%m-%d %H:%M:%S")
print(czas)

dane = soup.find('p', class_='genmed')

posty = dane.find_all('strong')[0].text
print(posty)

tematy = dane.find_all('strong')[1].text
print(tematy)

user = dane.find_all('strong')[2].text
print(user)

print()

csv_writer.writerow([czas, posty, tematy, user])    
csv_file.close()

I don't know how to make it run once a day and how to add data to the file once a day. Sorry if my questions are infantile for you pros ;), it's my first training assignment.

Also my reasult csv file looks not nice, i would like that the data will nice formated into columns

Any help and insight will be much appreciated. thx in advance Dejvciu

like image 443
Dejvciu Avatar asked Dec 22 '25 18:12

Dejvciu


1 Answers

You can use the Schedule library in Python to do this. First install it using

pip install schedule

Then you can modify your code to run at intervals of your choice

import schedule
import time

def scrape():
    # your web scraping code here
    print('web scraping')

schedule.every().day.at("10:30").do(scrape) # change 10:30 to time of your choice

while True:
    schedule.run_pending()
    time.sleep(1)

This will run the web scraping script every day at 10:30 and you can easily host it for free to make it run continually.

Here's how you would save the results to a csv in a nice formatted way with filednames (czas, tematy, posty and user) as column names.

import csv
from os import path

# this will avoid appending the headers (fieldnames or column names) everytime the script runs. Headers will be written to csv only once
file_status = path.isfile('filename.csv') 

with open('filename.csv', 'a+', newline='') as csvfile:
    fieldnames = ['czas', 'posty', 'tematy', 'user']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    if not file_status:
        writer.writeheader() 
    writer.writerow({'czas': czas, 'posty': posty, 'tematy': tematy, 'user': user})


like image 69
stuckoverflow Avatar answered Dec 24 '25 07:12

stuckoverflow



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!