I have two tab separated data files like below:
file 1:
number type data_present
1 a yes
2 b no
file 2:
type group number recorded
d aa 10 true
c cc 20 false
I want to merge these two files so that output file looks like below:
number type data_present group recorded
1 a yes NULL NULL
2 b no NULL NULL
10 d NULL aa true
20 cc NULL cc false
As you can see, for columns which are not present in other file, I'm filling those places with NULL.
Any ideas on how to do this in Scala/Spark?
Create two files for your data set:
$ cat file1.csv
number type data_present
1 a yes
2 b no
$ cat file2.csv
type group number recorded
d aa 10 true
c cc 20 false
Convert them to CSV:
$ sed -e 's/^[ \t]*//' file1.csv | tr -s ' ' | tr ' ' ',' > f1.csv
$ sed -e 's/^[ ]*//' file2.csv | tr -s ' ' | tr ' ' ',' > f2.csv
Use spark-csv module to load CSV files as dataframes:
$ spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df1 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f1.csv", "header" -> "true"))
val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f2.csv", "header" -> "true"))
Now perform joins:
scala> df1.join(df2, df1("number") <=> df2("number") && df1("type") <=> df2("type"), "outer").show()
+------+----+------------+----+-----+------+--------+
|number|type|data_present|type|group|number|recorded|
+------+----+------------+----+-----+------+--------+
| 1| a| yes|null| null| null| null|
| 2| b| no|null| null| null| null|
| null|null| null| d| aa| 10| true|
| null|null| null| c| cc| 20| false|
+------+----+------------+----+-----+------+--------+
For more details goto here, here and here.
This will give you the desired output:
val output = file1.join(file2, Seq("number","type"), "outer")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With