Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy csv output "randomly" missing fields

My scrapy crawler correctly reads all fields as the debug output shows:

2017-01-29 02:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.willhaben.at/iad/immobilien/mietwohnungen/niederoesterreich/krems-an-der-donau/altbauwohnung-wg-geeignet-donaublick-189058451/>
{'Heizung': 'Gasheizung', 'whCode': '189058451', 'Teilmöbliert / Möbliert': True, 'Wohnfläche': '105', 'Objekttyp': 'Zimmer/WG', 'Preis': 1050.0, 'Miete (inkl. MWSt)': 890.0, 'Stockwerk(e)': '2', 'Böden': 'Laminat', 'Bautyp': 'Altbau', 'Zustand': 'Sehr gut/gut', 'Einbauküche': True, 'Zimmer': 3.0, 'Miete (exkl. MWSt)': 810.0, 'Befristung': 'nein', 'Verfügbar': 'ab sofort', 'zipcode': 3500, 'Gesamtbelastung': 1150.0}

but when I output the csv using the command line option

scrapy crawl mietwohnungen -o mietwohnungen.csv --logfile=mietwohnungen.log

some of the fields are missing, as the corresponding line from the output file shows:

Keller,whCode,Garten,Zimmer,Terrasse,Wohnfläche,Parkplatz,Objekttyp,Befristung,zipcode,Preis
,189058451,,3.0,,105,,Zimmer/WG,nein,3500,1050.0

The fields missing in the example are: Heizung, Teilmöbliert / Möbliert, Miete (inkl. MWSt), Stockwerk(e), Böden, Bautyp, Zustand, Einbauküche, Miete (exkl. MWSt), Verfügbar, Gesamtbelastung

This happens with a few values that I scrape. One thing to note is that not every page contains the same fields, hence I generate the field names depending on the page. I create a dict containing all the fields present and yield that in the end. This works as the DEBUG output shows. However, some csv columns don't seem to be printed.

As you can see some columns are blank because other pages obviously have these fields (in the example 'Keller').

The scraper works if I use a smaller list to scrape (e.g. refine my initial search selection while keeping some of the problematic pages in the results):

Heizung,Zimmer,Bautyp,Gesamtbelastung,Einbauküche,Miete (exkl. MWSt),Zustand,Miete (inkl. MWSt),zipcode,Teilmöbliert / Möbliert,Objekttyp,Stockwerk(e),Böden,Befristung,Wohnfläche,whCode,Preis,Verfügbar
Gasheizung,3.0,Altbau,1150.0,True,810.0,Sehr gut/gut,890.0,3500,True,Zimmer/WG,2,Laminat,nein,105,189058451,1050.0,ab sofort

I have already changed to python3 to avoid any unicode string problems.

Is this a bug? This also seems to only affect the csv output, if I output to xml all fields are printed.

I don't understand why it does not work with the full list. Is the only solution really to write a csv exporter manually?

like image 793
MoRe Avatar asked Oct 19 '25 04:10

MoRe


1 Answers

If you yielding scraped results as dict, CSV columns will be populated from the keys of first yielded dict:

def _write_headers_and_set_fields_to_export(self, item):
    if self.include_headers_line:
        if not self.fields_to_export:
            if isinstance(item, dict):
                # for dicts try using fields of the first item
                self.fields_to_export = list(item.keys())
            else:
                # use fields declared in Item
                self.fields_to_export = list(item.fields.keys())
        row = list(self._build_row(self.fields_to_export))
        self.csv_writer.writerow(row)

So you should either define and populate Item with all the fields defined explicitly, or write custom CSVItemExporter.

like image 86
mizhgun Avatar answered Oct 20 '25 16:10

mizhgun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!