Union list of pyspark dataframes

Question

Let's say I have a list of pyspark dataframes: [df1, df2, ...], what I want is to union them (so actually do df1.union(df2).union(df3).... What's the best practice to achieve that?

samkart · Accepted Answer

you could use the reduce and pass the union function along with the list of dataframes.

import pyspark
from functools import reduce

list_of_sdf = [df1, df2, ...]
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)

the final_sdf will have the appended data.

saza · Answer

When some data frames have missing columns, use a partially applied function:

from functools import reduce, partial
from pyspark.sql import DataFrame

# Union dataframes by name (missing columns filled with null) 
union_by_name = partial(DataFrame.unionByName, allowMissingColumns=True)
df_output = reduce(union_by_name, [df1, df2, ...])

Union list of pyspark dataframes

Tags:

apache-spark

pyspark

mihagazvoda

2 Answers

samkart

saza

Recent Activity

Donate For Us

Union list of pyspark dataframes

Tags:

apache-spark

pyspark

mihagazvoda

2 Answers

samkart

saza

Related questions

Recent Activity

Donate For Us