Spark SQL can use FIRST_VALUE and LAST_VALUE in a GROUP BY aggregation (but it's not standard)

Tags:

apache-spark-sql

(Tested on Spark 2.2 and 2.3)

I am using Spark to aggregate stock trading ticks into daily OHLC (open-high-low-close) records.

The input data is like

val data = Seq(("2018-07-11 09:01:00", 34.0), ("2018-07-11 09:04:00", 32.0), ("2018-07-11 09:02:00", 35.0), ("2018-07-11 09:03:00", 30.0), ("2018-07-11 09:00:00", 33.0), ("2018-07-12 09:01:00", 56.0), ("2018-07-12 09:04:00", 54.0), ("2018-07-12 09:02:00", 51.0), ("2018-07-12 09:03:00", 50.0), ("2018-07-12 09:00:00", 51.0)).toDF("time", "price")

data.createOrReplaceTempView("ticks")

data.show

scala>

shown as

+-------------------+-----+
|               time|price|
+-------------------+-----+
|2018-07-11 09:01:00| 34.0|
|2018-07-11 09:04:00| 32.0|
|2018-07-11 09:02:00| 35.0|
|2018-07-11 09:03:00| 30.0|
|2018-07-11 09:00:00| 33.0|
|2018-07-12 09:01:00| 56.0|
|2018-07-12 09:04:00| 54.0|
|2018-07-12 09:02:00| 51.0|
|2018-07-12 09:03:00| 50.0|
|2018-07-12 09:00:00| 51.0|
+-------------------+-----+

Desired output is

+----------+----+----+----+-----+
|      date|open|high| low|close|
+----------+----+----+----+-----+
|2018-07-11|33.0|35.0|30.0| 32.0|
|2018-07-12|51.0|56.0|50.0| 54.0|
+----------+----+----+----+-----+

There have been many SQL solutions such as this and this.

SELECT
    TO_DATE(time) AS date,
    FIRST_VALUE(price) OVER (PARTITION BY TO_DATE(time) ORDER BY time) AS open,
    MAX(price) OVER (PARTITION BY TO_DATE(time) ORDER BY time) AS high,
    MIN(price) OVER (PARTITION BY TO_DATE(time) ORDER BY time) AS low,
    LAST_VALUE(price) OVER (PARTITION BY TO_DATE(time) ORDER BY time) AS close
FROM ticks

Due to the limitation of SQL, these solutions are cumbersome.

Today, I found Spark SQL can use FIRST_VALUE and LAST_VALUE in a GROUP BY context, which is not allowed in standard SQL.

This unlimitation of Spark SQL derives a neat and tidy solution like this:

SELECT
    TO_DATE(time) AS date,
    FIRST_VALUE(price) AS open,
    MAX(price) AS high,
    MIN(price) AS low,
    LAST_VALUE(price) AS close
FROM ticks
GROUP BY TO_DATE(time)

You can try it

spark.sql("SELECT TO_DATE(time) AS date, FIRST(price) AS open, MAX(price) AS high, MIN(price) AS low, LAST(price) AS close FROM ticks GROUP BY TO_DATE(time)").show

scala>

shown as

+----------+----+----+----+-----+
|      date|open|high| low|close|
+----------+----+----+----+-----+
|2018-07-11|34.0|35.0|30.0| 33.0|
|2018-07-12|56.0|56.0|50.0| 51.0|
+----------+----+----+----+-----+

However, the above result is incorrect. (Please compare with the above desired result.)

FIRST_VALUE and LAST_VALUE need a deterministic ordering to get deterministic results.

I can correct it by adding an orderBy before grouping.

import org.apache.spark.sql.functions._

data.orderBy("time").groupBy(expr("TO_DATE(time)").as("date")).agg(first("price").as("open"), max("price").as("high"), min("price").as("low"), last("price").as("close")).show

scala>

shown as

+----------+----+----+----+-----+
|      date|open|high| low|close|
+----------+----+----+----+-----+
|2018-07-11|33.0|35.0|30.0| 32.0|
|2018-07-12|51.0|56.0|50.0| 54.0|
+----------+----+----+----+-----+

which is correct as desired !!!

My question is, is the above code 'orderBy then groupBy' valid? Is this ordering guaranteed? Can we use this non-standard feature in serious productions?

The point of this question is that, in standard SQL, we can only do a GROUP BY then ORDER BY to sort the aggregation, but not ORDER BY then GROUP BY. The GROUP BY will ignore the ordering of ORDER BY.

I also wonder if Spark SQL can do such a GROUP BY under desired ordering, can standard SQL also invent such a syntax for this?

P.S.

I can think of some aggregation functions that depend on deterministic ordering.

WITH ORDER BY time SELECT COLLECT_LIST(price) GROUP BY stockID

WITH ORDER BY time SELECT SUM(SQUARE(price - LAG(price, 1, 0))) GROUP BY stockID

Without the WITH ORDER BY time, how can we sort the COLLECTed_LIST in standard SQL?

These examples show that "GROUP BY under desired ordering" is still useful.

570

asked Jul 11 '18 09:07

John Lin

1 Answers

Ordering in groupBy/agg not guaranted, you can use window function with partition by key and ordering by time

167

answered Oct 22 '22 14:10

K. Kostikov

Related questions
                            
                                Spark SQL has no SparkSqlParser.scala file when compiling in intelliJ idea
                            
                                Why does posexplode fail with "AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns..."?
                            
                                Meaning of Exchange in Spark Stage
                            
                                How to convert timestamp column to epoch seconds?
                            
                                Spark DataFrame: Computing row-wise mean (or any aggregate operation)
                            
                                Spark SQL - Select all AND computed columns?
                            
                                How do I truncate a PySpark dataframe of timestamp type to the day?
                            
                                Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]
                            
                                Remove blank space from data frame column values in Spark
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                How to register Python function as UDF in SparkSQL in Java/Scala?
                            
                                Spark JDBC fetchsize option
                            
                                Using pyspark, how do I read multiple JSON documents on a single line in a file into a dataframe?
                            
                                Is my understanding of parallel operations in Spark correct?
                            
                                Using a module with udf defined inside freezes pyspark job - explanation?
                            
                                Is this a bug of spark stream or memory leak?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL can use FIRST_VALUE and LAST_VALUE in a GROUP BY aggregation (but it's not standard)

Tags:

apache-spark-sql

John Lin

People also ask

1 Answers

K. Kostikov

Recent Activity

Donate For Us