Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

Question

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob

   val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")

//find the rows which have only one digit in the 7th column in the CSV
val rdd1 =  rdd.filter(s => s.split(",")(6).length() == 1)

rdd1.saveAsTextFile("wasb:///HVACOut")

When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.

How can I output it as a single text file instead?

lockwobr · Accepted Answer

Well I am not sure you can get just one file without a directory. If you do

rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")

you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.

Hope this helps.

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

Tags:

scala

apache-spark

azure

azure-blob-storage

azure-hdinsight

Mark

1 Answers

lockwobr

Recent Activity

Donate For Us

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

Tags:

scala

apache-spark

azure

azure-blob-storage

azure-hdinsight

Mark

1 Answers

lockwobr

Related questions

Recent Activity

Donate For Us