Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob

   val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")

//find the rows which have only one digit in the 7th column in the CSV
val rdd1 =  rdd.filter(s => s.split(",")(6).length() == 1)

rdd1.saveAsTextFile("wasb:///HVACOut")

When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.

How can I output it as a single text file instead?

like image 677
Mark Avatar asked Jan 17 '26 22:01

Mark


1 Answers

Well I am not sure you can get just one file without a directory. If you do

rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")

you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.

Hope this helps.

like image 50
lockwobr Avatar answered Jan 19 '26 14:01

lockwobr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!