Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge two data frame with few different columns

I want to merge several DataFrames having few different columns. Suppose ,

  • DataFrame A has 3 columns: Column_1, Column_2, Column 3

  • DataFrame B has 3 columns: Column_1, Columns_2, Column_4

  • DataFrame C has 3 Columns: Column_1, Column_2, Column_5

I want to merge these DataFrames such that I get a DataFrame like :

Column_1, Column_2, Column_3, Column_4 Column_5

number of DataFrames may increase. Is there any way to get this merge ? such that for a particular Column_1 Column_2 combination i get the values for other three columns in same row, and if for a particular combination of Column_1 Column_2 there is no data in some Columns then it should show null there.

DataFrame A:

Column_1 Column_2 Column_3
   1        x        abc
   2        y        def

DataFrame B:

Column_1 Column_2 Column_4
   1        x        xyz
   2        y        www
   3        z        sdf

The merge of A and B :

Column_1 Column_2 Column_3 Column_4
   1        x        abc     xyz
   2        y        def     www
   3        z        null    sdf
like image 771
Abhishek Tripathi Avatar asked Dec 22 '25 16:12

Abhishek Tripathi


1 Answers

If I understand your question correctly, you'll be needing to perform an outer join using a sequence of columns as keys.

I have used the data provided in your question to illustrate how it is done with an example :

scala> val df1 = Seq((1,"x","abc"),(2,"y","def")).toDF("Column_1","Column_2","Column_3")
// df1: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string]

scala> val df2 = Seq((1,"x","xyz"),(2,"y","www"),(3,"z","sdf")).toDF("Column_1","Column_2","Column_4")
// df2: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_4: string]

scala> val df3 = df1.join(df2, Seq("Column_1","Column_2"), "outer")
// df3: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string, Column_4: string]

scala> df3.show
// +--------+--------+--------+--------+                                           
// |Column_1|Column_2|Column_3|Column_4|
// +--------+--------+--------+--------+
// |       1|       x|     abc|     xyz|
// |       2|       y|     def|     www|
// |       3|       z|    null|     sdf|
// +--------+--------+--------+--------+

This is called an equi-join with another DataFrame using the given columns.

It is different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

Note

Outer equi-joins are available since Spark 1.6.

like image 158
eliasah Avatar answered Dec 24 '25 10:12

eliasah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!