Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strip headers from all files in RDD, where RDD = sc.textFile("s3n://bucket/*.csv")?

I am trying to think of the best way to do this, however, I am unable to think of a way that would not include reading headers from all files into array, and then filtering the RDD from those headers.

Is there a simpler way ?

NOTE: I am reading all csv files from a S3 bucket, and all of those files have a different header.

like image 817
3xCh1_23 Avatar asked Dec 05 '25 11:12

3xCh1_23


1 Answers

One option is to use SparkSQL, which can load CSV with the option to ignore the header. Take a look: https://github.com/databricks/spark-csv

header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.

like image 126
Dan Osipov Avatar answered Dec 08 '25 12:12

Dan Osipov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!