Spark fails to merge parquet files (INTEGER - DECIMAL)

Question

I've got 2 parquets files.

The first one contains the following column: DECIMAL: decimal(38,18) (nullable = true)

The second one has the same column, but with a different type: DECIMAL: integer (nullable = true)

I want to merge them, but I can't simply read them separatedly and throw a cast into the specific column, because this is part of an app that receives lots of distinct parquet schemas. I need something that would cover every scenario.

I am reading both like this:

df = spark.read.format("parquet").load(['path_to_file_one', 'path_to_file_2'])

It fails with the error below when I try to display the data

Parquet column cannot be converted. Column: [DECIMAL], Expected: DecimalType(38,18), Found: INT32

I am using Azure Databricks with the following configs:

DBR: 7.1
Spark 3.0.0

I have uploaded the parquet files here: https://easyupload.io/m/su37e8

Is there anyway I can force spark to autocast null columns into the type of the same column in the other dataframe?

It should be easy, all the columns are nullable...

CHEEKATLAPRADEEP-MSFT · Accepted Answer

This is expected if you are providing external schema with column datatype definition as a decimal and that column contains decimal(38,18).

enter image description here

We found that it's a limitation with the spark. Columns with datatype decimal(38,18).

Try df.show() to display the results.

enter image description here

Spark fails to merge parquet files (INTEGER -> DECIMAL)

Tags:

apache-spark

pyspark

azure-databricks

Flavio Pegas

1 Answers

CHEEKATLAPRADEEP-MSFT

Recent Activity

Donate For Us