Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark (PySpark) handling null values when reading in CSV

I'm trying to read in flight data from the Department of Transportation. It is stored in a CSV, and keep getting java.lang.NumberFormatException: null

I have tried setting the nanValue to the empty string, as it's default value is NaN, but this hasn't worked.

My current code is:

spark = SparkSession.builder \
    .master('local') \
    .appName('Flight Delay') \
    .getOrCreate()

schema = StructType([
    StructField('Year', IntegerType(), nullable=True),
    StructField('Month', IntegerType(), nullable=True),
    StructField('Day', IntegerType(), nullable=True),
    StructField('Dow', IntegerType(), nullable=True),
    StructField('CarrierId', StringType(), nullable=True),
    StructField('Carrier', StringType(), nullable=True),
    StructField('TailNum', StringType(), nullable=True),
    StructField('Origin', StringType(), nullable=True),
    StructField('Dest', StringType(), nullable=True),
    StructField('CRSDepTime', IntegerType(), nullable=True),
    StructField('DepTime', IntegerType(), nullable=True),
    StructField('DepDelay', DoubleType(), nullable=True),
    StructField('TaxiOut', DoubleType(), nullable=True),
    StructField('TaxiIn', DoubleType(), nullable=True),
    StructField('CRSArrTime', IntegerType(), nullable=True),
    StructField('ArrTime', IntegerType(), nullable=True),
    StructField('ArrDelay', DoubleType(), nullable=True),
    StructField('Cancelled', DoubleType(), nullable=True),
    StructField('CancellationCode', StringType(), nullable=True),
    StructField('Diverted', DoubleType(), nullable=True),
    StructField('CRSElapsedTime', DoubleType(), nullable=True),
    StructField('ActualElapsedTime', DoubleType(), nullable=True),
    StructField('AirTime', DoubleType(), nullable=True),
    StructField('Distance', DoubleType(), nullable=True),
    StructField('CarrierDelay', DoubleType(), nullable=True),
    StructField('WeatherDelay', DoubleType(), nullable=True),
    StructField('NASDelay', DoubleType(), nullable=True),
    StructField('SecurityDelay', DoubleType(), nullable=True),
    StructField('LateAircraftDelay', DoubleType(), nullable=True)
])

flts = spark.read \
    .format('com.databricks.spark.csv') \
    .csv('/home/william/Projects/flight-delay/data/201601.csv',
         schema=schema, nanValue='', header='true')

Here is the CSV I'm working with: http://pastebin.com/waahrgqB

The last row there is where it breaks and raises the java.lang.NumberFormatException: null

It seems that some numeric columns are empty strings, while others are just empty. Can someone please help me with this?

like image 765
wcwagner Avatar asked Sep 01 '25 02:09

wcwagner


2 Answers

Thanks to KiranM's suggestion, I was able to find a solution. I let Spark infer the schema(everything is set as a String), and then manually set the columns I want to be numeric.

Here is the code:

from pyspark.sql import (SQLContext,
                     SparkSession)

from pyspark.sql.types import (StructType,
                           StructField,
                           DoubleType,
                           IntegerType,
                           StringType)

spark = SparkSession.builder \
    .master('local') \
    .appName('Flight Delay') \
    .getOrCreate()


flts = spark.read \
    .format('com.databricks.spark.csv') \
    .csv('/home/william/Projects/flight-delay/data/merged/2016.csv',
         inferSchema='true', nanValue="", header='true', mode='PERMISSIVE')


flts = flts \
    .withColumn('Year', flts['Year'].cast('int')) \
    .withColumn('Month', flts['Month'].cast('int')) \
    .withColumn('Day', flts['Day'].cast('int')) \
    .withColumn('Dow', flts['Dow'].cast('int')) \
    .withColumn('CRSDepTime', flts['CRSDepTime'].cast('int')) \
    .withColumn('DepTime', flts['DepTime'].cast('int')) \
    .withColumn('DepDelay', flts['DepDelay'].cast('int')) \
    .withColumn('TaxiOut', flts['TaxiOut'].cast('int')) \
    .withColumn('TaxiIn', flts['TaxiIn'].cast('int')) \
    .withColumn('CRSArrTime', flts['CRSArrTime'].cast('int')) \
    .withColumn('ArrTime', flts['ArrTime'].cast('int')) \
    .withColumn('ArrDelay', flts['ArrDelay'].cast('int')) \
    .withColumn('Cancelled', flts['Cancelled'].cast('int')) \
    .withColumn('Diverted', flts['Diverted'].cast('int')) \
    .withColumn('CRSElapsedTime', flts['CRSElapsedTime'].cast('int')) \
    .withColumn('ActualElapsedTime', flts['ActualElapsedTime'].cast('int')) \
    .withColumn('AirTime', flts['AirTime'].cast('int')) \
    .withColumn('Distance', flts['Distance'].cast('int')) \
    .withColumn('CarrierDelay', flts['CarrierDelay'].cast('int')) \
    .withColumn('WeatherDelay', flts['WeatherDelay'].cast('int')) \
    .withColumn('NASDelay', flts['NASDelay'].cast('int')) \
    .withColumn('SecurityDelay', flts['SecurityDelay'].cast('int')) \
    .withColumn('LateAircraftDelay ', flts['LateAircraftDelay '].cast('int'))

Maybe I could put that into a loop, but I'm going to run with this for now.

like image 132
wcwagner Avatar answered Sep 02 '25 15:09

wcwagner


The issue is with the numeric type column having an empty string (with "" instead of blank data).

Then one option is to read the data as StringType column, then convert that column type to your relevant type (ex: int). So that it wouldn't impact other column data.

StructField('CRSDepTime', StringType(), nullable=True),


flts.withColumn('CRSDepTime', flts['CRSDepTime'].cast("int")) \
    .printSchema()

This should solve your problem.

like image 39
KiranM Avatar answered Sep 02 '25 15:09

KiranM