I have a quick question I hope someone can answer.
Is it possible to, within a pyspark dataframe, limit the amount of results a unique value in a certain column returns?
For example, I have a dataframe with 4 columns:
+-------+------------------+--------------+-------------+
|partner|productId | mediumtype| prediction|
+-------+------------------+--------------+-------------+
|7005410| 544| 1| 0.0026476993|
|7005410| 549| 2|-2.6975607E-4|
|7005410| 626| 3| 2.0409889E-4|
|7005410| 840| 2| 3.6301462E-5|
|7005410| 1192| 3| 2.2148499E-5|
+-------+------------------+--------------+-------------+
The partner column has a numeric value (in this case 7005410). In the example you see 5 rows with a unique partner id, but in reality there are 7.000+ unique partner id's. And this counts not only for this partner, but for all the other 4.000+ partners, meaning there are 7.000 * 4.000 = 28.000.000 rows in total.
What I want it to only fetch the top 5 predictions (limit(5)) for every unique partner id. So far I have put the dataframe in a loop with a filter but due to the lazy loading nature of Spark this takes a tremendous amount of time to complete, I was wondering if there is just a way to apply a limit(5) to every unique partner id.
Try this. Take row_number() over a window and filter out where ever rowNum is greater than 5. Considering data is already ordered, we can just orderby a literal constant in the window.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("partner").orderBy(F.lit(1))
df.withColumn("rowNum", F.row_number().over(w))\
.filter('rowNum<=5').drop("rowNum").show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With