Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark bitwiseAND vs ampersand operator

I am trying to add a column to a dataframe that indicates when two different values are both found in a nested array

 expr1 = array_contains(df.child_list, "value1")
 expr2 = array_contains(df.child_list, "value2")

I got it to work with an ampersand operator

 df.select(...).withColumn("boolTest", expr1 & expr2)

Then I tried to replace this with bitwiseAND with the thought being, I would want to have a list of these expressions ANDed together dynamically.

This fails with an error

 df.select(...).withColumn("boolTest", expr1.bitwiseAND(expr2))

 cannot resolve ..... due to data type mismatch: '(array_contains(c1.`child_list`, 'value1') & 
array_contains(c1.`child_list`, 'value2'))' requires integral type, 
not boolean;;

What's the distinction and what am I doing wrong?

like image 922
wrschneider Avatar asked Aug 31 '25 05:08

wrschneider


1 Answers

The & and | operators work on BooleanType columns in pyspark operate as logical AND and OR operations. In other words they take True/False as input and output True/False.

The bitwiseAND functions does bit by bit AND'ing of two numeric values. So they could take two integers and output the bitwise AND'ing of them.

Here is an example of each:

from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = StructType([   
  StructField("b1", BooleanType()), 
  StructField("b2", BooleanType()),
  StructField("int1", IntegerType()), 
  StructField("int2", IntegerType())
])
data = [
  (True, True, 0x01, 0x01), 
  (True, False, 0xFF, 0xA), 
  (False, False, 0x01, 0x00)
]

df = sqlContext.createDataFrame(sc.parallelize(data), schema)


df2 = df.withColumn("logical", df.b1 & df.b2) \
        .withColumn("bitwise", df.int1.bitwiseAND(df.int2))

df2.printSchema()
df2.show()

+-----+-----+----+----+-------+-------+
|   b1|   b2|int1|int2|logical|bitwise|
+-----+-----+----+----+-------+-------+
| true| true|   1|   1|   true|      1|
| true|false| 255|  10|  false|     10|
|false|false|   1|   0|  false|      0|
+-----+-----+----+----+-------+-------+


>>> df2.printSchema()
root
 |-- b1: boolean (nullable = true)
 |-- b2: boolean (nullable = true)
 |-- int1: integer (nullable = true)
 |-- int2: integer (nullable = true)
 |-- logical: boolean (nullable = true)
 |-- bitwise: integer (nullable = true)

If you want to dynamically AND together a list of columns, you can do it like this:

columns = [col("b1"), col("b2")]
df.withColumn("result", reduce(lambda a, b: a & b, columns))
like image 74
Ryan Widmaier Avatar answered Sep 02 '25 21:09

Ryan Widmaier