Delete rows in PySpark dataframe based on multiple conditions

Question

I have a dataframe with a structure similar to the following:

col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,C,A,D
A,F,A,A
A,V,B,A

What I want is to 'drop' the rows where conditions are met for all columns at the same time. For example, drop rows where col1 == A and col2 == C at the same time. Note that, in this case, the only row that should be dropped would be "A,C,A,D" as it's the only one where both conditions are met at the same time. Hence, the dataframe should look like this:

col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,F,A,A
A,V,B,A

What I've tried so far is:

# spark library import
import pyspark.sql.functions as F

df = df.filter(
               ((F.col("col1") != "A") & (F.col("col2") != "C"))
               )

This one doesn't filter as I want, because it removes all rows where only one condition is met, likecol1 == "A" or col2 == "C", returning:

col1, col2, col3, col4
B,C,A,D

Can anybody please help me out with this?

Thanks

mck · Accepted Answer

Combine both conditions and do a NOT:

cond = (F.col('col1') == 'A') & (F.col('col2') == 'C')

df.filter(~cond)

Yukeshkumar · Answer

from pyspark.sql.functions import when
df.withColumn('Result',when(df.col1!='A',"True").when(df.col2!='C',"True")).filter("Result==True").drop("Result").show()

Delete rows in PySpark dataframe based on multiple conditions

Tags:

python

dataframe

pyspark

marsolmos

2 Answers

mck

Yukeshkumar

Recent Activity

Donate For Us

Delete rows in PySpark dataframe based on multiple conditions

Tags:

python

dataframe

pyspark

marsolmos

2 Answers

mck

Yukeshkumar

Related questions

Recent Activity

Donate For Us