Let's say I have a list of pyspark dataframes: [df1, df2, ...], what I want is to union them (so actually do df1.union(df2).union(df3).... What's the best practice to achieve that?
you could use the reduce and pass the union function along with the list of dataframes.
import pyspark
from functools import reduce
list_of_sdf = [df1, df2, ...]
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)
the final_sdf will have the appended data.
When some data frames have missing columns, use a partially applied function:
from functools import reduce, partial
from pyspark.sql import DataFrame
# Union dataframes by name (missing columns filled with null)
union_by_name = partial(DataFrame.unionByName, allowMissingColumns=True)
df_output = reduce(union_by_name, [df1, df2, ...])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With