In MySQL, I can have a query like this:
select  
    cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour
    , ... 
from
    some_table t 
group by
    timeHour, ...
order by
    timeHour, ...
where timeHour in the GROUP BY is the result of a select expression.
But I just tried a query similar to that in Sqark SQL, and I got an error of
Error: org.apache.spark.sql.AnalysisException: 
cannot resolve '`timeHour`' given input columns: ...
My query for Spark SQL was this:
select  
      cast(t.unixTime as timestamp) as timeHour
    , ...
from
    another_table as t
group by
    timeHour, ...
order by
    timeHour, ...
Is this construct possible in Spark SQL?
Is this construct possible in Spark SQL?
Yes, It is. You can make it works in Spark SQL in 2 ways to use new column in GROUP BY and ORDER BY clauses
Approach 1 using sub query :
SELECT timeHour, someThing FROM (SELECT  
      from_unixtime((starttime/1000)) AS timeHour
    , sum(...)                          AS someThing
    , starttime
FROM
    some_table) 
WHERE
    starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
      AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
    timeHour
ORDER BY
    timeHour
LIMIT 10;
Approach 2 using WITH // elegant way :
-- create alias 
WITH table_aliase AS(SELECT  
      from_unixtime((starttime/1000)) AS timeHour
    , sum(...)                          AS someThing
    , starttime
FROM
    some_table)
-- use the same alias as table
SELECT timeHour, someThing FROM table_aliase
WHERE
    starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
      AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
    timeHour
ORDER BY
    timeHour
LIMIT 10;
Alternative using Spark DataFrame(wo SQL) API with Scala :
// This code may need additional import to work well
val df = .... //load the actual table as df
import org.apache.spark.sql.functions._
df.withColumn("timeHour", from_unixtime($"starttime"/1000))
  .groupBy($"timeHour")
  .agg(sum("...").as("someThing"))
  .orderBy($"timeHour")
  .show()
//another way - as per eliasah comment
df.groupBy(from_unixtime($"starttime"/1000).as("timeHour"))
  .agg(sum("...").as("someThing"))
  .orderBy($"timeHour")
  .show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With