Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read parquet file having mixed data type in a column

I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).

val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")

This throws me exception : Failed to merge incompatible data types IntegerType and StringType

Is there a way to explicitly type cast the column during read ?

like image 213
Phagun Baya Avatar asked Dec 07 '25 05:12

Phagun Baya


1 Answers

The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:

def unionReduce(dfs: Seq[DataFrame]) = {
  dfs.reduce{ (x, y) =>
    def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
    val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
    val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
      Try(df.withColumn(name, col(name).cast(dataType))) match {
        case Success(newDf) => newDf
        case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
      }
    }
    fixedX.select(y.columns.map(col): _*).unionAll(y)
  }
}

The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.

like image 53
dkmonet Avatar answered Dec 09 '25 20:12

dkmonet



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!