From the pyspark docs, I Can do:
gdf = df.groupBy(df.name)
sorted(gdf.agg({"*": "first"}).collect())
In my actual use case I have maaaany variables, so I like that I can simply create a dictionary, which is why:
gdf = df.groupBy(df.name)
sorted(gdf.agg(F.first(col, ignorenulls=True)).collect())
@lemon's suggestion won't work for me.
How can I pass a parameter for first (i.e. ignorenulls=True), see here.
You can use list comprehension.
gdf.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns]).collect()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With