How would one go about implementing a skip/take query (typical server side grid paging) using Spark SQL. I have scoured the net and can only find very basic examples such as these here: https://databricks-training.s3.amazonaws.com/data-exploration-using-spark-sql.html
I don't see any concept of ROW_NUMBER() or OFFSET/FETCH like with T-SQL. Does anyone know how to accomplish this?
Something like:
scala > csc.sql("select * from users skip 10 limit 10").collect()
Try something like this:
val rdd = csc.sql("select * from <keyspace>.<table>")
val rdd2 = rdd.view.zipWithIndex()
rdd2.filter(x => { x._2 > 5 && x._2 < 10;}).collect()
rdd2.filter(x => { x._2 > 9 && x._2 < 12;}).collect()
I found that both sparksql and dataframe don't have limit with offset. May be in distributed data is random distributed, so limit with offset only have meanings in order by limit. we can use window function to implement it:
1. Consider we want to get product, of which revenue rank from 2 to 5
2. implementation
windowSpec = Window.partitionBy().orderBy(df.revenue.asc())
result = df.select(
"product",
"category",
"revenue",
row_number().over(windowSpec).alias("row_number"),
dense_rank().over(windowSpec).alias("rank"))
result.show()
result = result.filter((col("rank") >= start) & (col("rank") <= end))
result.show()
please refer to https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With