Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a DataFrame as compressed (gzipped) CSV?

I use Spark 1.6.0 and Scala.

I want to save a DataFrame as compressed CSV format.

Here is what I have so far (assume I already have df and sc as SparkContext):

//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

df.write
  .format("com.databricks.spark.csv")
  .save(my_directory)

The output is not in gz format.

like image 442
user2628641 Avatar asked Sep 02 '25 05:09

user2628641


2 Answers

This code works for Spark 2.1, where .codec is not available.

df.write
  .format("com.databricks.spark.csv")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(my_directory)

For Spark 2.2, you can use the df.write.csv(...,codec="gzip") option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

like image 67
Ravi Kant Saini Avatar answered Sep 05 '25 00:09

Ravi Kant Saini


With Spark 2.0+, this has become a bit simpler:

df.write.csv("path", compression="gzip")  # Python-only
df.write.option("compression", "gzip").csv("path") // Scala or Python

You don't need the external Databricks CSV package anymore.

The csv() writer supports a number of handy options. For example:

  • sep: To set the separator character.
  • quote: Whether and how to quote values.
  • header: Whether to include a header line.

There are also a number of other compression codecs you can use, in addition to gzip:

  • bzip2
  • lz4
  • snappy
  • deflate

The full Spark docs for the csv() writer are here: Python / Scala

like image 32
Nick Chammas Avatar answered Sep 05 '25 01:09

Nick Chammas