In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one
val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.
How can this be achieved with DataFrames in Spark version 1.3.0?
Pretty simple. Use the except() to subtract or find the difference between two dataframes.
subtract() as applied on two RDDs. It is used to return the elements present in the first RDD but not present in the second. RDD. distinct() is applied on single RDD that is used to return unique elements from the RDD.
Spark SQL supports three types of set operators: EXCEPT or MINUS.
According to the Scala API docs, doing:
dataFrame1.except(dataFrame2) will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With