Spark CSV package not able to handle
within fields

Question

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have within them for e.g. the following two rows

"XYZ", "Test Data", "TestNew
line", "OtherData" 
"XYZ", "Test Data", "blablablabla

blablablablablalbal", "OtherData"

I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.

 SQLContext sqlContext = new SQLContext(sc);
    DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .option("parserLib","univocity")
        .load("data.csv");

How do I replace newline within fields which starts with quotes. Is there any easier way?

Tegan Snyder · Accepted Answer

There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile but prior to release was renamed multiLine.

Here is an example of loading in a CSV to a dataframe with that option:

var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))

Spark CSV package not able to handle \n within fields

Tags:

scala

apache-spark

apache-spark-sql

spark-csv

apache-spark-1.6

Umesh K

1 Answers

Tegan Snyder

Recent Activity

Donate For Us