I'm doing a UNION of two temp tables and trying to order by column but spark complains that the column I am ordering by cannot be resolved. Is this a bug or I'm missing something?
lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
import org.apache.spark.sql.types.StringType
val oldOrders = Seq(
Seq("old_order_id1", "old_order_name1", "true"),
Seq("old_order_id2", "old_order_name2", "true")
)
val newOrders = Seq(
Seq("new_order_id1", "new_order_name1", "false"),
Seq("new_order_id2", "new_order_name2", "false")
)
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("is_old", StringType)
val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)
oldOrdersDF.createOrReplaceTempView("old_orders")
newOrdersDF.createOrReplaceTempView("new_orders")
//ordering by column not in select works if I'm not doing UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|ORDER BY oo.is_old
""".stripMargin).show()
//ordering by column not in select doesn't work as I'm doing a UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|UNION
|SELECT no.id, no.name FROM new_orders no
|ORDER BY oo.is_old
""".stripMargin).show()
The output of the above code is:
+-------------+---------------+
| id| name|
+-------------+---------------+
|old_order_id1|old_order_name1|
|old_order_id2|old_order_name2|
+-------------+---------------+
cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
So ordering by a column that's not in the SELECT clause works if I'm not doing a UNION and it fails if I'm doing a UNION of two tables.
The syntax of Spark SQL is very similar to SQL, but they are working very differently. Under the hood of Spark, its all about Rdds/dataframes.
After the UNION statement, a new dataframe is generated, and we are not able to refer the fields from the old table/dataframe if we did not select them.
How to fix:
spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
| UNION
| SELECT no.id, no.name, no.is_old FROM new_orders no
| ORDER BY oo.is_old
| ) t
""".stripMargin).show()
Thanks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With