How to select rows given a list of tuples with a PySpark DataFrame?

Question

Let's say we have a DataFrame like this:

+--------+--------------+-----+--------------------+
|aid     |bid           |value|                time|
+--------+--------------+-----+--------------------+
|       1|             1| 81.0|2006-08-25 14:13:...|
|       1|             1| 81.0|2006-08-25 14:27:...|
|       1|             2| 81.0|2006-08-25 14:56:...|
|       1|             2| 81.0|2006-08-25 15:00:...|
|       1|             3| 81.0|2006-08-25 15:31:...|
|       1|             3| 81.0|2006-08-25 15:38:...|
|       1|             4|  0.0|2006-08-30 11:59:...|
|       1|             4|  0.0|2006-08-30 13:59:...|
|       2|             1|  0.0|2006-08-30 12:11:...|
|       2|             1|  0.0|2006-08-30 14:13:...|
|       2|             2|  0.0|2006-08-30 12:30:...|
|       2|             2|  0.0|2006-08-30 14:30:...|
|       2|             3|  0.0|2006-09-05 12:29:...|
|       2|             3|  0.0|2006-09-05 14:31:...|
|       3|             1|  0.0|2006-09-05 12:42:...|
|       3|             1|  0.0|2006-09-05 14:43:...|
+--------+--------------+-----+--------------------+

I know I can do this:

df_data.where(col('bid')
       .isin([1,2,3])).show()

in order to select only the rows which have a bid of one of [1,2,3].

However, I want to be able to select a subset based on a list of tuples [(1,1), (2,2), (3,1)] for the two columns aid and bid.

So basically "something like"

df_data.where(col(['aid', 'bid'])
       .isin([(1,1), (2,2), (3,1)])).show()

Is there a way to do this?

I could imagine something like this:

sql.sql('SELECT * FROM df_data WHERE (scope_id, measurement_id) IN ((1,1))')

but this will throw:

AnalysisException: "cannot resolve '(struct(df_data.`aid`, df_data.`bid`) IN (struct(1, 1)))' due to data type mismatch: Arguments must be same type; line 1 pos 55"

pault · Accepted Answer

I can think of three ways.

Method 1: Use `reduce` to help check all conditions

The pseudocode (s, m) IN [(1,1), (2,2), (3,1)] is equivalent to:

(s == 1 and m == 1) or (s == 2 and m == 2) or (s == 3 and m == 3)

You can check all of these conditions using a list comprehension and reduce.

import pyspark.sql.functions as f
check_list = [(1,1), (2,2), (3,1)]
df.where(
        reduce(
            lambda u, v: u|v,
            [(f.col("aid") == x) & (f.col("bid") == y) for (x,y) in check_list]
        )
    )\
    .select("aid", "bid", "value")\
    .show()
#+---+---+-----+
#|aid|bid|value|
#+---+---+-----+
#|  1|  1| 81.0|
#|  1|  1| 81.0|
#|  2|  2|  0.0|
#|  2|  2|  0.0|
#|  3|  1|  0.0|
#|  3|  1|  0.0|
#+---+---+-----+

Method 2: Concatenate the IDs as strings

Create a temporary column as the string concatenation of the two id columns. Then check to see if that string matches a list of strings.

check_list = [(1,1), (2,2), (3,1)]
check_list_str = [",".join([str(x) for x in item]) for item in check_list]

df.withColumn("combined_id", f.concat(f.col("aid"), f.lit(","), f.col("bid")))\
    .where(f.col("combined_id").isin(check_list_str))\
    .select("aid", "bid", "value")\
    .show()
#+---+---+-----+
#|aid|bid|value|
#+---+---+-----+
#|  1|  1| 81.0|
#|  1|  1| 81.0|
#|  2|  2|  0.0|
#|  2|  2|  0.0|
#|  3|  1|  0.0|
#|  3|  1|  0.0|
#+---+---+-----+

Method 3: Use a UDF

Create a udf to check the boolean condition.

check_list = [(1,1), (2,2), (3,1)]
check_id_isin = f.udf(lambda x, y: (x, y) in check_list, BooleanType())

df.where(check_id_isin(f.col("aid"), f.col("bid")) == True)\
    .select("aid", "bid", "value")\
    .show()
#+---+---+-----+
#|aid|bid|value|
#+---+---+-----+
#|  1|  1| 81.0|
#|  1|  1| 81.0|
#|  2|  2|  0.0|
#|  2|  2|  0.0|
#|  3|  1|  0.0|
#|  3|  1|  0.0|
#+---+---+-----+

EDIT As @StefanFalk pointed out, one could write the udf more generally as:

check_id_isin = f.udf(lambda *idx: idx in check_list, BooleanType())

Which will allow for a variable number of input parameters.

How to select rows given a list of tuples with a PySpark DataFrame?

Tags:

pyspark

Stefan Falk

1 Answers

Method 1: Use `reduce` to help check all conditions

Method 2: Concatenate the IDs as strings

Method 3: Use a UDF

pault

Recent Activity

Donate For Us

How to select rows given a list of tuples with a PySpark DataFrame?

Tags:

pyspark

Stefan Falk

1 Answers

Method 1: Use reduce to help check all conditions

Method 2: Concatenate the IDs as strings

Method 3: Use a UDF

pault

Related questions

Recent Activity

Donate For Us

Method 1: Use `reduce` to help check all conditions