I'm using pyspark. So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
Need Output:
a | b_list
5 | 2,1,4,3
2 | 4,2,3,7
It's important to keep the sequence as given in output.
Instead of udf, for joining the list, we can also use concat_ws function as suggested in comments above, like this:
import pyspark.sql.functions as F
df = (df
      .withColumn('lst', F.concat(df['b'], F.lit(','), df['c']).alias('lst'))
      .groupBy('a')
      .agg( F.concat_ws(',', F.collect_list('lst').alias('b_list')).alias('lst')))
df.show()
+---+-------+
|  a|    lst|
+---+-------+
|  5|2,1,4,3|
|  2|4,2,3,7|
+---+-------+
The following results in the last 2 columns aggregated into an array column:
df1 = df.withColumn('lst', f.concat(df['b'], f.lit(','), df['c']).alias('lst'))\
  .groupBy('a')\
  .agg( f.collect_list('lst').alias('b_list'))
Now join array elements:
#Simplistic udf to joing array:
def join_array(col):
    return ','.join(col)
join = f.udf(join_array)
df1.select('a', join(df1['b_list']).alias('b_list'))\
  .show()
Printing:
+---+-------+
|  a| b_list|
+---+-------+
|  5|2,1,4,3|
|  2|4,2,3,7|
+---+-------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With