Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe filter both nulls and spaces

I have a spark dataframe for which I need to filter nulls and spaces for a particular column.

Lets say dataframe has two columns. col2 has both nulls and also blanks.

col1   col2
1      abc
2      null
3      null
4   
5      def

I want to apply a filter out the records which have col2 as nulls or blanks. Can any one please help on this.

Version: Spark1.6.2 Scala 2.10

like image 324
Ramesh Avatar asked Oct 21 '25 03:10

Ramesh


1 Answers

The standard logical operators are defined on Spark Columns:

scala> val myDF = Seq((1, "abc"),(2,null),(3,null),(4, ""),(5,"def")).toDF("col1", "col2")
myDF: org.apache.spark.sql.DataFrame = [col1: int, col2: string]

scala> myDF.show
+----+----+
|col1|col2|
+----+----+
|   1| abc|
|   2|null|
|   3|null|
|   4|    |
|   5| def|
+----+----+


scala> myDF.filter(($"col2" =!= "") && ($"col2".isNotNull)).show
+----+----+
|col1|col2|
+----+----+
|   1| abc|
|   5| def|
+----+----+

Note: depending on your Spark version you will need !== or =!= (the latter is the more current option).

If you had n conditions to be met I would probably use a list to reduce the boolean columns together:

val conds = List(myDF("a").contains("x"), myDF("b") =!= "y", myDF("c") > 2)

val filtered = myDF.filter(conds.reduce(_&&_))
like image 129
evan.oman Avatar answered Oct 24 '25 08:10

evan.oman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!