Sparklyr

Question

Is there a way to change data types of columns when reading parquet files? I'm using the spark_read_parquet function from Sparklyr, but it doesn't have the columns option (from spark_read_csv) to change it.

In csv files, I would do something like:

data_tbl <- spark_read_csv(sc, "data", path, infer_schema = FALSE, columns = list_with_data_types)

How could I do something similar with parquet files?

LucasMation · Accepted Answer

Specifying data types only makes sense when reading a data format that does not have built in metadata on variable types. This is the case with csv or fwf files, which, at most, have variable names in the first row. Thus the read functions for such files have that functionality.

This sort of functionality does not make sense for data formats that have built in variable types, such as Parquet (or .Rds and .Rds in R).

This in this case you should:

a) read the Parquet file into Spark b) make the necessary data transformations c) save the transformed data into a Parquet file, overwriting the previous file

Sparklyr - How to change the parquet data types

Tags:

r

apache-spark

parquet

Igor

1 Answers

LucasMation

Recent Activity

Donate For Us