How to join/merge a list of dataframes with common keys in PySpark?

Tags:

df1
     uid1  var1
0  John         3
1  Paul         4
2  George       5

df2
     uid1  var2
0  John         23
1  Paul         44
2  George       52

df3
     uid1  var3
0  John         31
1  Paul         45
2  George       53

df_lst=[df1,df2,df3]

How do I merge/join the 3 dataframes in the list based on common key uid1 ?

Edit: Expected output

   df1
     uid1  var1     var2    var3
0  John         3        23      31
1  Paul         4        44      45
2  George       5        52      53

613

asked Jun 13 '17 08:06

GeorgeOfTheRF

2 Answers

You can join a list of dataframe. Below is the simple example

import spark.implicits._
    val df1 = spark.sparkContext.parallelize(Seq(
      (0,"John",3),
    (1,"Paul",4),
    (2,"George",5)
    )).toDF("id", "uid1", "var1")

    import spark.implicits._
    val df2 = spark.sparkContext.parallelize(Seq(
      (0,"John",23),
      (1,"Paul",44),
      (2,"George",52)
    )).toDF("id", "uid1", "var2")

    import spark.implicits._
    val df3 = spark.sparkContext.parallelize(Seq(
      (0,"John",31),
      (1,"Paul",45),
      (2,"George",53)
    )).toDF("id", "uid1", "var3")


    val df = List(df1, df2, df3)

    df.reduce((a,b) => a.join(b, Seq("id", "uid1")))

Output:

+---+------+----+----+----+
| id|  uid1|var1|var2|var3|
+---+------+----+----+----+
|  1|  Paul|   4|  44|  45|
|  2|George|   5|  52|  53|
|  0|  John|   3|  23|  31|
+---+------+----+----+----+

Hope this helps!

answered Sep 18 '22 14:09

koiralo

Let me suggest python answer:

from pyspark import SparkContext
SparkContext._active_spark_context.stop()
sc = SparkContext()
sqlcontext = SQLContext(sc)

import pyspark.sql.types as t

rdd_list = [sc.parallelize([('John',i+1),('Paul',i+2),('George',i+3)],1) \
            for i in [100,200,300]]
df_list = []
for i,r in enumerate(rdd_list):
    schema = t.StructType().add('uid1',t.StringType())\
                           .add('var{}'.format(i+1),t.IntegerType())
    df_list.append(sqlcontext.createDataFrame(r, schema))
    df_list[-1].show()

+------+----+
|  uid1|var1|
+------+----+
|  John| 101|
|  Paul| 102|
|George| 103|
+------+----+

+------+----+
|  uid1|var2|
+------+----+
|  John| 201|
|  Paul| 202|
|George| 203|
+------+----+

+------+----+
|  uid1|var3|
+------+----+
|  John| 301|
|  Paul| 302|
|George| 303|
+------+----+

df_res = df_list[0]
for df_next in df_list[1:]:
    df_res = df_res.join(df_next,on='uid1',how='inner')
df_res.show()

+------+----+----+----+
|  uid1|var1|var2|var3|
+------+----+----+----+
|  John| 101| 201| 301|
|  Paul| 102| 202| 302|
|George| 103| 203| 303|
+------+----+----+----+

One more option:

def join_red(left,right):
    return left.join(right,on='uid1',how='inner')

res = reduce(join_red, df_list)
res.show()

+------+----+----+----+
|  uid1|var1|var2|var3|
+------+----+----+----+
|  John| 101| 201| 301|
|  Paul| 102| 202| 302|
|George| 103| 203| 303|
+------+----+----+----+

answered Sep 19 '22 14:09

Fedo

Related questions
                            
                                Django ContextMixin 'super' object has no attribute 'get_context_data'
                            
                                How to access top five Google result links using Beautifulsoup
                            
                                Python Matplotlib Box Plot Two Data Sets Side by Side
                            
                                run Cython in Jupyter cdef
                            
                                Correctly centring text (PIL/Pillow)
                            
                                Python list comprehension with a function as the output and the conditional
                            
                                How to merge two nested dict in python?
                            
                                Vue app doesn't load when served through Python Flask server
                            
                                pandas to_datetime() then concat() on DateTime Index
                            
                                Sort Python Dictionary by Absolute Value of Values
                            
                                Python matplotlib, get position of xtick labels
                            
                                How to increment variable every time script is run in Python?
                            
                                Subtract aggregate from Pandas Series/Dataframe [duplicate]
                            
                                Amazon SES with Flask Python
                            
                                Getting a variable from the caller's globals. What is a frame object?
                            
                                dictionary add values for the same keys
                            
                                fastest way to create numpy 2d array of indices
                            
                                Generating a recursive sitemap with relative href links
                            
                                Run scrapy in background (Ubuntu)
                            
                                Pandas join/merge/concat two DataFrames and combine rows of identical key/index [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to join/merge a list of dataframes with common keys in PySpark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

GeorgeOfTheRF

People also ask

2 Answers

koiralo

Fedo

Recent Activity

Donate For Us