Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Web Scraping - BeautifulSoup & CSV

I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.

Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city

Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!

  1. The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
  2. When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
  3. (related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?

CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%

PREFERRED CSV FORMAT:

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42

Here is my current code:

import requests
import csv
from bs4 import BeautifulSoup

#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'

i=0
while i<len(Textfilelistsplit):
    url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
    page  = requests.get(url).text
    soup_expatistan = BeautifulSoup(page)

    #Prepare CSV writer.
    WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
    WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])

    expatistan_table = soup_expatistan.find("table",class_="comparison")
    expatistan_titles = expatistan_table.find_all("tr",class_="expandable")

    for expatistan_title in expatistan_titles:
            percent_difference = expatistan_title.find("th",class_="percent")
            percent_difference_title = percent_difference.span['class']
            if percent_difference_title == "expensiver":
                WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
            else:
                WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
    i+=1
like image 888
user3599514 Avatar asked Mar 12 '26 09:03

user3599514


2 Answers

Answers:

  • Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:

    if percent_difference_title == "expensiver" 
    

    with:

    if "expensiver" in percent_difference.span['class']
    
  • Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).

Other issues:

  • open csv file for writing before the loop
  • use with context managers while working with files
  • try to follow PEP8 style guide

Here's the code with modifications:

import requests
import csv
from bs4 import BeautifulSoup

BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'

with open('City.txt') as input_file:
    with open("Expatistan.csv", "w") as output_file:
        writer = csv.writer(output_file)
        writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
        for line in input_file:
            city = line.strip()
            url = BASE_URL.format(home_city=home_city, city=city)
            soup = BeautifulSoup(requests.get(url).text)

            table = soup.find("table", class_="comparison")
            differences = []
            for title in table.find_all("tr", class_="expandable"):
                percent_difference = title.find("th", class_="percent")
                if "expensiver" in percent_difference.span['class']:
                    differences.append('+' + percent_difference.span.string)
                else:
                    differences.append('-' + percent_difference.span.string)
            writer.writerow([city] + differences)

For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%

Make sure you understand what changes have I made. Let me know if you need further help.

like image 91
alecxe Avatar answered Mar 13 '26 21:03

alecxe


csv.writer.writerow() takes a sequence and makes each element a column; normally you'd give it a list with columns, but you are passing in strings instead; that'll add individual characters as columns instead.

Just build a list, then write it to the CSV file.

First, open the CSV file once, not for every separate city; you are clearing out the file every time you open it.

import requests
import csv
from bs4 import BeautifulSoup

HomeCity = 'Phoenix'

with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(["City", "Food", "Housing", "Clothes",
                     "Transportation", "Personal Care", "Entertainment"])

    for line in cities:
        city = line.strip()
        url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
            HomeCity, city)
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)

        titles = soup.select("table.comparison tr.expandable")

        row = [city]
        for title in titles:
            percent_difference = title.find("th", class_="percent")
            changeclass = percent_difference.span['class']
            change = percent_difference.span.string
            if "expensiver" in changeclass:
                change = '+' + change
            else:
                change = '-' + change
            row.append(change)
         writer.writerow(row)
like image 35
Martijn Pieters Avatar answered Mar 13 '26 23:03

Martijn Pieters