I have a pyspark dataframe where one column is filled with list, either containing entries or just empty lists. I want to efficiently filter out all rows that contain empty lists.
import pyspark.sql.functions as sf
df.filter(sf.col('column_with_lists') != []) 
returns me the following error:
Py4JJavaError: An error occurred while calling o303.notEqual.
: java.lang.RuntimeException: Unsupported literal type class
Perhaps I can check the length of the list and impose it should be > 0 (see here). However, I am unsure how this syntax works if I am using pyspark-sql and if filter even allows a lambda.
Perhaps to make clear, I have multiple columns but want to apply the above filter on a single one, removing all entries. The linked SO example filters on a single column.
Thanks in advance!
So it appears it is as simple as using the size function from sql.functions:
import pyspark.sql.functions as sf
df.filter(sf.size('column_with_lists') > 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With