Let's say we have a DataFrame
like this:
+--------+--------------+-----+--------------------+
|aid |bid |value| time|
+--------+--------------+-----+--------------------+
| 1| 1| 81.0|2006-08-25 14:13:...|
| 1| 1| 81.0|2006-08-25 14:27:...|
| 1| 2| 81.0|2006-08-25 14:56:...|
| 1| 2| 81.0|2006-08-25 15:00:...|
| 1| 3| 81.0|2006-08-25 15:31:...|
| 1| 3| 81.0|2006-08-25 15:38:...|
| 1| 4| 0.0|2006-08-30 11:59:...|
| 1| 4| 0.0|2006-08-30 13:59:...|
| 2| 1| 0.0|2006-08-30 12:11:...|
| 2| 1| 0.0|2006-08-30 14:13:...|
| 2| 2| 0.0|2006-08-30 12:30:...|
| 2| 2| 0.0|2006-08-30 14:30:...|
| 2| 3| 0.0|2006-09-05 12:29:...|
| 2| 3| 0.0|2006-09-05 14:31:...|
| 3| 1| 0.0|2006-09-05 12:42:...|
| 3| 1| 0.0|2006-09-05 14:43:...|
+--------+--------------+-----+--------------------+
I know I can do this:
df_data.where(col('bid')
.isin([1,2,3])).show()
in order to select only the rows which have a bid
of one of [1,2,3]
.
However, I want to be able to select a subset based on a list of tuples [(1,1), (2,2), (3,1)]
for the two columns aid
and bid
.
So basically "something like"
df_data.where(col(['aid', 'bid'])
.isin([(1,1), (2,2), (3,1)])).show()
Is there a way to do this?
I could imagine something like this:
sql.sql('SELECT * FROM df_data WHERE (scope_id, measurement_id) IN ((1,1))')
but this will throw:
AnalysisException: "cannot resolve '(struct(df_data.`aid`, df_data.`bid`) IN (struct(1, 1)))' due to data type mismatch: Arguments must be same type; line 1 pos 55"
I can think of three ways.
reduce
to help check all conditionsThe pseudocode (s, m) IN [(1,1), (2,2), (3,1)]
is equivalent to:
(s == 1 and m == 1) or (s == 2 and m == 2) or (s == 3 and m == 3)
You can check all of these conditions using a list comprehension and reduce
.
import pyspark.sql.functions as f
check_list = [(1,1), (2,2), (3,1)]
df.where(
reduce(
lambda u, v: u|v,
[(f.col("aid") == x) & (f.col("bid") == y) for (x,y) in check_list]
)
)\
.select("aid", "bid", "value")\
.show()
#+---+---+-----+
#|aid|bid|value|
#+---+---+-----+
#| 1| 1| 81.0|
#| 1| 1| 81.0|
#| 2| 2| 0.0|
#| 2| 2| 0.0|
#| 3| 1| 0.0|
#| 3| 1| 0.0|
#+---+---+-----+
Create a temporary column as the string concatenation of the two id
columns. Then check to see if that string matches a list of strings.
check_list = [(1,1), (2,2), (3,1)]
check_list_str = [",".join([str(x) for x in item]) for item in check_list]
df.withColumn("combined_id", f.concat(f.col("aid"), f.lit(","), f.col("bid")))\
.where(f.col("combined_id").isin(check_list_str))\
.select("aid", "bid", "value")\
.show()
#+---+---+-----+
#|aid|bid|value|
#+---+---+-----+
#| 1| 1| 81.0|
#| 1| 1| 81.0|
#| 2| 2| 0.0|
#| 2| 2| 0.0|
#| 3| 1| 0.0|
#| 3| 1| 0.0|
#+---+---+-----+
Create a udf
to check the boolean condition.
check_list = [(1,1), (2,2), (3,1)]
check_id_isin = f.udf(lambda x, y: (x, y) in check_list, BooleanType())
df.where(check_id_isin(f.col("aid"), f.col("bid")) == True)\
.select("aid", "bid", "value")\
.show()
#+---+---+-----+
#|aid|bid|value|
#+---+---+-----+
#| 1| 1| 81.0|
#| 1| 1| 81.0|
#| 2| 2| 0.0|
#| 2| 2| 0.0|
#| 3| 1| 0.0|
#| 3| 1| 0.0|
#+---+---+-----+
EDIT As @StefanFalk pointed out, one could write the udf
more generally as:
check_id_isin = f.udf(lambda *idx: idx in check_list, BooleanType())
Which will allow for a variable number of input parameters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With