Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark Mapping Elements in Array within a Dataframe to another Dataframe

I have two dataframes. The first dataframe has an array as its value for column2 and I want to join it with the second dataframe so the numerical values are mapped to their string values. The order of the elements should stay the same since they correspond to the array elements in column3 by index.

df_one:

 column1|  column2|        column3
----------------------------------
"thing1"|[1,2,3..]|[0.1,0.2,0.3..]
"thing2"|[1,2,3..]|[0.1,0.2,0.3..]
"thing3"|[1,2,3..]|[0.1,0.2,0.3..]
...

df_two:

columnA|columnB
---------------
      1|"item1"
      2|"item2"
      3|"item3"
...

Is there a way to join these dataframes and select the columns like so:

column1 |                  newColumn|        column3
----------------------------------------------------
"thing1"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing2"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing3"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
...
like image 832
bingitybang Avatar asked Oct 30 '25 09:10

bingitybang


1 Answers

As mentioned in the comments, explode on column2 and then join on columnA is the good way to go. Yet, when you group back the data, I am not sure that the order would always be preserved.

To be sure, and avoid costly UDF in python, you could use posexplode to keep track of the position of each element, and then an ordered window function to build the list back:

df_one = spark.createDataFrame([("thing1", [1, 2, 3], "X"), ("thing2", [1, 2, 3], "Y"), ("thing3", [1, 2, 3], "Z")],
                               ["column1", "column2", "column3"])
df_two = spark.createDataFrame([(1, "item1"), (2, "item2"), (3, "item3")],
                               ["columnA", "columnB"])

w = Window.partitionBy("column1").orderBy("pos")

df_one\
    .select("*", f.posexplode("column2").alias("pos", "columnA"))\
    .join(df_two, ['columnA'])\
    .withColumn("newColumn", f.collect_list("columnB").over(w))\
    .where(f.col("pos")+1 == f.size(f.col("column2")))\
    .select("column1", "newColumn", "column3")\
    .show(truncate=False)
+-------+---------------------+-------+
|column1|newColumn            |column3|
+-------+---------------------+-------+
|thing1 |[item1, item2, item3]|X      |
|thing2 |[item1, item2, item3]|Y      |
|thing3 |[item1, item2, item3]|Z      |
+-------+---------------------+-------+
like image 190
Oli Avatar answered Nov 01 '25 00:11

Oli