Apache Spark: Get the first and last row of each partition

Question

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this? In my code I repartition my dataset based on a key column using:

mydf.repartition(keyColumn).sortWithinPartitions(sortKey)

Is there a way to get the first row and last row for each partition? Thanks

Richard Nemeth · Accepted Answer

I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.

You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:

import pyspark
import pyspark.sql.functions as f

w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
 withColumn("rn_asc", f.row_number().over(w_asc)). \
 withColumn("rn_desc", f.row_number().over(w_desc)). \
 where("rn_asc = 1 or rn_desc = 1")

The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.

Apache Spark: Get the first and last row of each partition

Tags:

apache-spark

pyspark

Java Developr

1 Answers

Richard Nemeth

Recent Activity

Donate For Us

Apache Spark: Get the first and last row of each partition

Tags:

apache-spark

pyspark

Java Developr

1 Answers

Richard Nemeth

Related questions

Recent Activity

Donate For Us