Limit returned rows per unique pyspark dataframe column value without a loop

Question

I have a quick question I hope someone can answer.

Is it possible to, within a pyspark dataframe, limit the amount of results a unique value in a certain column returns?

For example, I have a dataframe with 4 columns:

+-------+------------------+--------------+-------------+
|partner|productId         |    mediumtype|   prediction|
+-------+------------------+--------------+-------------+
|7005410|               544|             1| 0.0026476993|
|7005410|               549|             2|-2.6975607E-4|
|7005410|               626|             3| 2.0409889E-4|
|7005410|               840|             2| 3.6301462E-5|
|7005410|              1192|             3| 2.2148499E-5|
+-------+------------------+--------------+-------------+

The partner column has a numeric value (in this case 7005410). In the example you see 5 rows with a unique partner id, but in reality there are 7.000+ unique partner id's. And this counts not only for this partner, but for all the other 4.000+ partners, meaning there are 7.000 * 4.000 = 28.000.000 rows in total.

What I want it to only fetch the top 5 predictions (limit(5)) for every unique partner id. So far I have put the dataframe in a loop with a filter but due to the lazy loading nature of Spark this takes a tremendous amount of time to complete, I was wondering if there is just a way to apply a limit(5) to every unique partner id.

murtihash · Accepted Answer

Try this. Take row_number() over a window and filter out where ever rowNum is greater than 5. Considering data is already ordered, we can just orderby a literal constant in the window.

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy("partner").orderBy(F.lit(1))

df.withColumn("rowNum", F.row_number().over(w))\
  .filter('rowNum<=5').drop("rowNum").show()

Limit returned rows per unique pyspark dataframe column value without a loop

Tags:

python

loops

dataframe

apache-spark

pyspark

sebas2201

1 Answers

murtihash

Recent Activity

Donate For Us

Limit returned rows per unique pyspark dataframe column value without a loop

Tags:

python

loops

dataframe

apache-spark

pyspark

sebas2201

1 Answers

murtihash

Related questions

Recent Activity

Donate For Us