Skip/Take with Spark SQL

Question

How would one go about implementing a skip/take query (typical server side grid paging) using Spark SQL. I have scoured the net and can only find very basic examples such as these here: https://databricks-training.s3.amazonaws.com/data-exploration-using-spark-sql.html

I don't see any concept of ROW_NUMBER() or OFFSET/FETCH like with T-SQL. Does anyone know how to accomplish this?

Something like:

scala > csc.sql("select * from users skip 10 limit 10").collect()

phact · Accepted Answer

Try something like this:

val rdd = csc.sql("select * from <keyspace>.<table>")
val rdd2 = rdd.view.zipWithIndex()
rdd2.filter(x => { x._2 > 5 && x._2 < 10;}).collect()
rdd2.filter(x => { x._2 > 9 && x._2 < 12;}).collect()

Cyanny · Answer

I found that both sparksql and dataframe don't have limit with offset. May be in distributed data is random distributed, so limit with offset only have meanings in order by limit. we can use window function to implement it:

1. Consider we want to get product, of which revenue rank from 2 to 5

2. implementation

windowSpec = Window.partitionBy().orderBy(df.revenue.asc())
result = df.select(
    "product",
    "category",
    "revenue",
    row_number().over(windowSpec).alias("row_number"),
    dense_rank().over(windowSpec).alias("rank"))
    result.show()
    result = result.filter((col("rank") >= start) & (col("rank") <= end))
    result.show()

please refer to https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Skip/Take with Spark SQL

Tags:

sql

scala

apache-spark

apache-spark-sql

datastax-enterprise

KingOfHypocrites

2 Answers

phact

Cyanny

Recent Activity

Donate For Us

Skip/Take with Spark SQL

Tags:

sql

scala

apache-spark

apache-spark-sql

datastax-enterprise

KingOfHypocrites

2 Answers

phact

Cyanny

Related questions

Recent Activity

Donate For Us