Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this
| Grouping | Bucket | Date |
|----------|--------|-----------|
| 1 | 1 | 9-20-2020 |
| 1 | 1 | 9-16-2020 |
| 1 | 1 | 8-14-2020 |
| 1 | 2 | 9-20-2020 |
| 1 | 2 | 9-16-2020 |
| 1 | 2 | 8-14-2020 |
| 2 | 1 | 9-20-2020 |
| 2 | 1 | 9-16-2020 |
| 2 | 1 | 8-14-2020 |
Would turn into this:
| Grouping | Bucket | Date |
|----------|--------|-----------|
| 1 | 1 | 9-20-2020 |
| 1 | 1 | 9-16-2020 |
| 1 | 2 | 9-20-2020 |
| 1 | 2 | 9-16-2020 |
| 2 | 1 | 9-20-2020 |
| 2 | 1 | 9-16-2020 |
Edit: I reviewed my question after edit and realized that not doing the edit to begin with was the right choice
Use a window and take the top two ranks within each window:
df.withColumn("r", F.rank().over(Window.partitionBy("Grouping", "Bucket") \
.orderBy(F.col("Date").desc()))) \
.filter("r <=2") \
.drop("r") \
.show()
Output:
+--------+------+----------+
|Grouping|Bucket| Date|
+--------+------+----------+
| 1| 1|2020-09-20|
| 1| 1|2020-09-16|
| 1| 2|2020-09-20|
| 1| 2|2020-09-16|
| 2| 1|2020-09-20|
| 2| 1|2020-09-16|
+--------+------+----------+
Edit: this answer applies to this revision (get the most recent N records for each group).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With