I use Spark sql to load data into a val like this
val customers = sqlContext.sql("SELECT * FROM customers")
But I have a separate txt file that contains one column CUST_ID and 50,00 rows. i.e.
CUST_ID
1
2
3
I want my customers val to have all customers in customers table that are not in the TXT file.
Using Sql I would do this by SELECT * FROM customers NOT IN cust_id ('1','2','3')
How can I do this using Spark?
I've read the textFile and I can print rows of it but I'm not sure how to match this with my sql query
scala> val custids = sc.textFile("cust_ids.txt")
scala> custids.take(4).foreach(println)
CUST_ID
1
2
3
You can import your text file as a dataframe and do a left outer join:
val customers = Seq(("1", "AAA", "shipped"), ("2", "ADA", "delivered") , ("3", "FGA", "never received")).toDF("id","name","status")
val custId = Seq(1,2).toDF("custId")
customers.join(custId,'id === 'custId,"leftOuter")
.where('custId.isNull)
.drop("custId")
.show()
+---+----+--------------+
| id|name| status|
+---+----+--------------+
| 3| FGA|never received|
+---+----+--------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With