Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark CSV package not able to handle \n within fields

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows

"XYZ", "Test Data", "TestNew\nline", "OtherData" 
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData" 

I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.

 SQLContext sqlContext = new SQLContext(sc);
    DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .option("parserLib","univocity")
        .load("data.csv");

How do I replace newline within fields which starts with quotes. Is there any easier way?

like image 411
Umesh K Avatar asked Oct 24 '25 21:10

Umesh K


1 Answers

There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile but prior to release was renamed multiLine.

Here is an example of loading in a CSV to a dataframe with that option:

var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))
like image 148
Tegan Snyder Avatar answered Oct 27 '25 10:10

Tegan Snyder



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!