Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL UNION - ORDER BY column not in SELECT

I'm doing a UNION of two temp tables and trying to order by column but spark complains that the column I am ordering by cannot be resolved. Is this a bug or I'm missing something?

lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
      import org.apache.spark.sql.types.StringType

      val oldOrders = Seq(
        Seq("old_order_id1", "old_order_name1", "true"),
        Seq("old_order_id2", "old_order_name2", "true")
      )

      val newOrders = Seq(
        Seq("new_order_id1", "new_order_name1", "false"),
        Seq("new_order_id2", "new_order_name2", "false")
      )
      val schema = new StructType()
        .add("id", StringType)
        .add("name", StringType)
        .add("is_old", StringType)

      val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
      val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)

      oldOrdersDF.createOrReplaceTempView("old_orders")
      newOrdersDF.createOrReplaceTempView("new_orders")

      //ordering by column not in select works if I'm not doing UNION
      spark.sql(
        """
          |SELECT oo.id, oo.name FROM old_orders oo
          |ORDER BY oo.is_old
        """.stripMargin).show()

      //ordering by column not in select doesn't work as I'm doing a UNION
      spark.sql(
        """
          |SELECT oo.id, oo.name FROM old_orders oo
          |UNION
          |SELECT no.id, no.name FROM new_orders no
          |ORDER BY oo.is_old
        """.stripMargin).show()

The output of the above code is:

    +-------------+---------------+
    |           id|           name|
    +-------------+---------------+
    |old_order_id1|old_order_name1|
    |old_order_id2|old_order_name2|
    +-------------+---------------+
    
    
    cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
    'Sort ['oo.is_old ASC NULLS FIRST], true
    +- Distinct
       +- Union
          :- Project [id#121, name#122]
          :  +- SubqueryAlias oo
          :     +- SubqueryAlias old_orders
          :        +- LogicalRDD [id#121, name#122, is_old#123]
          +- Project [id#131, name#132]
             +- SubqueryAlias no
                +- SubqueryAlias new_orders
                   +- LogicalRDD [id#131, name#132, is_old#133]
    
    org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
    'Sort ['oo.is_old ASC NULLS FIRST], true
    +- Distinct
       +- Union
          :- Project [id#121, name#122]
          :  +- SubqueryAlias oo
          :     +- SubqueryAlias old_orders
          :        +- LogicalRDD [id#121, name#122, is_old#123]
          +- Project [id#131, name#132]
             +- SubqueryAlias no
                +- SubqueryAlias new_orders
                   +- LogicalRDD [id#131, name#132, is_old#133]

So ordering by a column that's not in the SELECT clause works if I'm not doing a UNION and it fails if I'm doing a UNION of two tables.

like image 314
Todor Kolev Avatar asked Sep 07 '25 12:09

Todor Kolev


1 Answers

The syntax of Spark SQL is very similar to SQL, but they are working very differently. Under the hood of Spark, its all about Rdds/dataframes.

After the UNION statement, a new dataframe is generated, and we are not able to refer the fields from the old table/dataframe if we did not select them.

How to fix:

spark.sql(
      """
        |SELECT id, name
        |FROM (
        | SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
        | UNION
        | SELECT no.id, no.name, no.is_old FROM new_orders no
        | ORDER BY oo.is_old
        | ) t
      """.stripMargin).show()

Thanks.

like image 145
L.Li Avatar answered Sep 11 '25 23:09

L.Li