Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete rows in PySpark dataframe based on multiple conditions

I have a dataframe with a structure similar to the following:

col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,C,A,D
A,F,A,A
A,V,B,A

What I want is to 'drop' the rows where conditions are met for all columns at the same time. For example, drop rows where col1 == A and col2 == C at the same time. Note that, in this case, the only row that should be dropped would be "A,C,A,D" as it's the only one where both conditions are met at the same time. Hence, the dataframe should look like this:

col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,F,A,A
A,V,B,A

What I've tried so far is:

# spark library import
import pyspark.sql.functions as F

df = df.filter(
               ((F.col("col1") != "A") & (F.col("col2") != "C"))
               )

This one doesn't filter as I want, because it removes all rows where only one condition is met, likecol1 == "A" or col2 == "C", returning:

col1, col2, col3, col4
B,C,A,D

Can anybody please help me out with this?

Thanks

like image 378
marsolmos Avatar asked Oct 17 '25 06:10

marsolmos


2 Answers

Combine both conditions and do a NOT:

cond = (F.col('col1') == 'A') & (F.col('col2') == 'C')

df.filter(~cond)
like image 197
mck Avatar answered Oct 18 '25 19:10

mck


from pyspark.sql.functions import when
df.withColumn('Result',when(df.col1!='A',"True").when(df.col2!='C',"True")).filter("Result==True").drop("Result").show()
like image 21
Yukeshkumar Avatar answered Oct 18 '25 18:10

Yukeshkumar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!