Dealing with Ties in Rank : Pyspark

Question

I have a dataframe like this below.

+---+-------+-------+
|ayy| artist|numbers|
+---+-------+-------+
|  a|  Monet|  10000|
|  a|   Dali|  10000|
|  a|Gauguin|  10000|
|  b|  Monet|  10000|
|  b|   Dali|  10000|
|  b|Gauguin|  10000|
+---+-------+-------+

I wanted to pick one 'artist' based on their 'numbers', but I wanted make sure I pick random when there is a tie between 'numbers'

I implemented the following below.

w = Window.partitionBy('ayy').orderBy(F.col('numbers').desc())
df_test = df_test.withColumn('rank', F.rank().over(w))\
                .withColumn('rank2', F.row_number().over(w))
df_test.show()

which gave me

+---+-------+-------+----+-----+
|ayy| artist|numbers|rank|rank2|
+---+-------+-------+----+-----+
|  a|  Monet|  10000|   1|    1|
|  a|Gauguin|  10000|   1|    2|
|  a|   Dali|  10000|   1|    3|
|  b|  Monet|  10000|   1|    1|
|  b|   Dali|  10000|   1|    2|
|  b|Gauguin|  10000|   1|    3|
+---+-------+-------+----+-----+

See, in this case, I am able to get just a single artist based on their 'numbers' but I want to make sure that I pick a random 'artist' in the case of a tie.

My first thought was to go for udf, but I am still not quite sure how to go about this.

mck · Accepted Answer

You can add a random number in the range [0.0, 1.0) to the rank, and then assign a row number based on the rank:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rank', 
    F.rank().over(Window.partitionBy('ayy').orderBy(F.col('numbers').desc())) 
    + F.rand(seed=1)
).withColumn(
    'rank', 
    F.row_number().over(Window.partitionBy('ayy').orderBy('rank'))
)

df2.show()
+---+-------+-------+----+
|ayy| artist|numbers|rank|
+---+-------+-------+----+
|  b|Gauguin|  10000|   1|
|  b|   Dali|  10000|   2|
|  b|  Monet|  10000|   3|
|  a|  Monet|  10000|   1|
|  a|Gauguin|  10000|   2|
|  a|   Dali|  10000|   3|
+---+-------+-------+----+

Dealing with Ties in Rank : Pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

Joon

1 Answers

mck

Recent Activity

Donate For Us

Dealing with Ties in Rank : Pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

Joon

1 Answers

mck

Related questions

Recent Activity

Donate For Us