Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read csv with second line as header in pyspark dataframe

I am trying to load a csv and make the second line as header. How to achieve this. Please let me know. Thanks.

file_location = "/mnt/test/raw/data.csv"
file_type = "csv"    

infer_schema = "true"
delimiter = ","

data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", "false") \
  .option("sep", delimiter) \
  .load(file_location) \
like image 987
Lilly Avatar asked Dec 12 '25 19:12

Lilly


1 Answers

First Read the data as rdd and then pass this rdd to df.read.csv()

data=sc.TextFile('/mnt/test/raw/data.csv')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
df = spark.read.csv(data,header=True)

For reference of dataframe functions use the below link, This would serve as bible for all of the dataframe operations you need, for specific version of spark replace "latest" in url to whatever version you want:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

like image 121
Girish Iyer Avatar answered Dec 15 '25 06:12

Girish Iyer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!