Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skip/Take with Spark SQL

How would one go about implementing a skip/take query (typical server side grid paging) using Spark SQL. I have scoured the net and can only find very basic examples such as these here: https://databricks-training.s3.amazonaws.com/data-exploration-using-spark-sql.html

I don't see any concept of ROW_NUMBER() or OFFSET/FETCH like with T-SQL. Does anyone know how to accomplish this?

Something like:

scala > csc.sql("select * from users skip 10 limit 10").collect()
like image 717
KingOfHypocrites Avatar asked May 07 '26 08:05

KingOfHypocrites


2 Answers

Try something like this:

val rdd = csc.sql("select * from <keyspace>.<table>")
val rdd2 = rdd.view.zipWithIndex()
rdd2.filter(x => { x._2 > 5 && x._2 < 10;}).collect()
rdd2.filter(x => { x._2 > 9 && x._2 < 12;}).collect()
like image 192
phact Avatar answered May 09 '26 21:05

phact


I found that both sparksql and dataframe don't have limit with offset. May be in distributed data is random distributed, so limit with offset only have meanings in order by limit. we can use window function to implement it:

1. Consider we want to get product, of which revenue rank from 2 to 5

2. implementation

windowSpec = Window.partitionBy().orderBy(df.revenue.asc())
result = df.select(
    "product",
    "category",
    "revenue",
    row_number().over(windowSpec).alias("row_number"),
    dense_rank().over(windowSpec).alias("rank"))
    result.show()
    result = result.filter((col("rank") >= start) & (col("rank") <= end))
    result.show()

please refer to https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

like image 40
Cyanny Avatar answered May 09 '26 23:05

Cyanny



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!