Merge two data frame with few different columns

Question

I want to merge several DataFrames having few different columns. Suppose ,

DataFrame A has 3 columns: Column_1, Column_2, Column 3
DataFrame B has 3 columns: Column_1, Columns_2, Column_4
DataFrame C has 3 Columns: Column_1, Column_2, Column_5

I want to merge these DataFrames such that I get a DataFrame like :

Column_1, Column_2, Column_3, Column_4 Column_5

number of DataFrames may increase. Is there any way to get this merge ? such that for a particular Column_1 Column_2 combination i get the values for other three columns in same row, and if for a particular combination of Column_1 Column_2 there is no data in some Columns then it should show null there.

DataFrame A:

Column_1 Column_2 Column_3
   1        x        abc
   2        y        def

DataFrame B:

Column_1 Column_2 Column_4
   1        x        xyz
   2        y        www
   3        z        sdf

The merge of A and B :

Column_1 Column_2 Column_3 Column_4
   1        x        abc     xyz
   2        y        def     www
   3        z        null    sdf

eliasah · Accepted Answer

If I understand your question correctly, you'll be needing to perform an outer join using a sequence of columns as keys.

I have used the data provided in your question to illustrate how it is done with an example :

scala> val df1 = Seq((1,"x","abc"),(2,"y","def")).toDF("Column_1","Column_2","Column_3")
// df1: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string]

scala> val df2 = Seq((1,"x","xyz"),(2,"y","www"),(3,"z","sdf")).toDF("Column_1","Column_2","Column_4")
// df2: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_4: string]

scala> val df3 = df1.join(df2, Seq("Column_1","Column_2"), "outer")
// df3: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string, Column_4: string]

scala> df3.show
// +--------+--------+--------+--------+                                           
// |Column_1|Column_2|Column_3|Column_4|
// +--------+--------+--------+--------+
// |       1|       x|     abc|     xyz|
// |       2|       y|     def|     www|
// |       3|       z|    null|     sdf|
// +--------+--------+--------+--------+

This is called an equi-join with another DataFrame using the given columns.

It is different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

Note

Outer equi-joins are available since Spark 1.6.

Merge two data frame with few different columns

Tags:

dataframe

apache-spark

apache-spark-sql

Abhishek Tripathi

1 Answers

eliasah

Recent Activity

Donate For Us

Merge two data frame with few different columns

Tags:

dataframe

apache-spark

apache-spark-sql

Abhishek Tripathi

1 Answers

eliasah

Related questions

Recent Activity

Donate For Us