Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When can symbols be used to represent columns in spark sql?

Consider a basic groupBy expression on a DataFrame:

val groupDf  = rsdf.groupBy("league","vendor").agg(mean('league),mean('vendor))

The groupBy portion is fine: it is using strings for column names. However the agg (/mean) is not - since apparently Symbol were not supported here.

I am wondering why Symbol's do not work here - and when they are permitted in Spark SQL.

like image 962
WestCoastProjects Avatar asked Sep 02 '25 02:09

WestCoastProjects


1 Answers

The short answer is never. There are no DataFrame methods which support Symbols directly.

The long answers is everywhere, where Spark compiler expects Column, but you'll need additional objects in scope.

The only reason why Symbols work at all is implicit conversion from Symbol to Column provided SQLImplicits.implicits.

Once imported, compiler will be able to cast Symbol, whenever Column is required, including agg (and implicits are in the scope):

import spark.implicits._
import org.apache.spark.sql.functions._

val df = Seq((1, 2)).toDF("league", "vendor")

df.groupBy("league","vendor").agg(mean('league),mean('vendor)).show

+------+------+-----------+-----------+                                         
|league|vendor|avg(league)|avg(vendor)|
+------+------+-----------+-----------+
|     1|     2|        1.0|        2.0|
+------+------+-----------+-----------+
like image 132
Alper t. Turker Avatar answered Sep 04 '25 23:09

Alper t. Turker