I have a dataframe like this below.
+---+-------+-------+
|ayy| artist|numbers|
+---+-------+-------+
| a| Monet| 10000|
| a| Dali| 10000|
| a|Gauguin| 10000|
| b| Monet| 10000|
| b| Dali| 10000|
| b|Gauguin| 10000|
+---+-------+-------+
I wanted to pick one 'artist' based on their 'numbers', but I wanted make sure I pick random when there is a tie between 'numbers'
I implemented the following below.
w = Window.partitionBy('ayy').orderBy(F.col('numbers').desc())
df_test = df_test.withColumn('rank', F.rank().over(w))\
.withColumn('rank2', F.row_number().over(w))
df_test.show()
which gave me
+---+-------+-------+----+-----+
|ayy| artist|numbers|rank|rank2|
+---+-------+-------+----+-----+
| a| Monet| 10000| 1| 1|
| a|Gauguin| 10000| 1| 2|
| a| Dali| 10000| 1| 3|
| b| Monet| 10000| 1| 1|
| b| Dali| 10000| 1| 2|
| b|Gauguin| 10000| 1| 3|
+---+-------+-------+----+-----+
See, in this case, I am able to get just a single artist based on their 'numbers' but I want to make sure that I pick a random 'artist' in the case of a tie.
My first thought was to go for udf, but I am still not quite sure how to go about this.
You can add a random number in the range [0.0, 1.0) to the rank, and then assign a row number based on the rank:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rank',
F.rank().over(Window.partitionBy('ayy').orderBy(F.col('numbers').desc()))
+ F.rand(seed=1)
).withColumn(
'rank',
F.row_number().over(Window.partitionBy('ayy').orderBy('rank'))
)
df2.show()
+---+-------+-------+----+
|ayy| artist|numbers|rank|
+---+-------+-------+----+
| b|Gauguin| 10000| 1|
| b| Dali| 10000| 2|
| b| Monet| 10000| 3|
| a| Monet| 10000| 1|
| a|Gauguin| 10000| 2|
| a| Dali| 10000| 3|
+---+-------+-------+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With