Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this? In my code I repartition my dataset based on a key column using:

mydf.repartition(keyColumn).sortWithinPartitions(sortKey)

Is there a way to get the first row and last row for each partition? Thanks

like image 331
Java Developr Avatar asked Feb 03 '26 02:02

Java Developr


1 Answers

I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.

You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:

import pyspark
import pyspark.sql.functions as f

w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
 withColumn("rn_asc", f.row_number().over(w_asc)). \
 withColumn("rn_desc", f.row_number().over(w_desc)). \
 where("rn_asc = 1 or rn_desc = 1")

The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.

like image 123
Richard Nemeth Avatar answered Feb 04 '26 17:02

Richard Nemeth



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!