Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark fails to merge parquet files (INTEGER -> DECIMAL)

I've got 2 parquets files.

The first one contains the following column: DECIMAL: decimal(38,18) (nullable = true)

The second one has the same column, but with a different type: DECIMAL: integer (nullable = true)

I want to merge them, but I can't simply read them separatedly and throw a cast into the specific column, because this is part of an app that receives lots of distinct parquet schemas. I need something that would cover every scenario.

I am reading both like this:

df = spark.read.format("parquet").load(['path_to_file_one', 'path_to_file_2'])

It fails with the error below when I try to display the data

Parquet column cannot be converted. Column: [DECIMAL], Expected: DecimalType(38,18), Found: INT32

I am using Azure Databricks with the following configs:

  • DBR: 7.1
  • Spark 3.0.0

I have uploaded the parquet files here: https://easyupload.io/m/su37e8

Is there anyway I can force spark to autocast null columns into the type of the same column in the other dataframe?

It should be easy, all the columns are nullable...

like image 506
Flavio Pegas Avatar asked Feb 04 '26 10:02

Flavio Pegas


1 Answers

This is expected if you are providing external schema with column datatype definition as a decimal and that column contains decimal(38,18).

enter image description here

We found that it's a limitation with the spark. Columns with datatype decimal(38,18).

Try df.show() to display the results.

enter image description here

like image 167
CHEEKATLAPRADEEP-MSFT Avatar answered Feb 06 '26 00:02

CHEEKATLAPRADEEP-MSFT