I want to verify if an array contain a string in Pyspark (Spark < 2.4).
Example Dataframe:
column_1 <Array> | column_2 <String>
--------------------------------------------
["2345","98756","8794"] | 8794
--------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------
["3475","8956","45678"] | 3475
--------------------------------------------
I would like to compare the two columns column_1 and column_2. if column_1 contain column_2 I should skip it's value from column_1. I did an udf to soustract column_2 from column_1, but is not working:
def contains(x, y):
try:
sx, sy = set(x), set(y)
if len(sx) == 0:
return sx
elif len(sy) == 0:
return sx
else:
return sx - sy
# in exception, for example `x` or `y` is None (not a list)
except:
return sx
udf_contains = udf(contains, 'string')
new_df = my_df.withColumn('column_1', udf_contains(my_df.column_1, my_df.column_2))
Expect result:
column_1 <Array> | column_2 <String>
--------------------------------------------------
["2345","98756"] | 8794
--------------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------------
["8956","45678"] | 3475
--------------------------------------------------
How can I do it knowing that sometimes / cases I have column_1 is [] and column_2 is null ? Thank you
Spark 2.4.0+
Try array_remove
. It is available since spark 2.4.0:
val df = Seq(
(Seq("2345","98756","8794"), "8794"),
(Seq("8756","45678","987563"), "1234"),
(Seq("3475","8956","45678"), "3475"),
(Seq(), "empty"),
(null, "null")
).toDF("column_1", "column_2")
df.show(5, false)
df
.select(
$"column_1",
$"column_2",
array_remove($"column_1", $"column_2") as "diff"
).show(5, false)
It will return:
+---------------------+--------+
|column_1 |column_2|
+---------------------+--------+
|[2345, 98756, 8794] |8794 |
|[8756, 45678, 987563]|1234 |
|[3475, 8956, 45678] |3475 |
|[] |empty |
|null |null |
+---------------------+--------+
+---------------------+--------+---------------------+
|column_1 |column_2|diff |
+---------------------+--------+---------------------+
|[2345, 98756, 8794] |8794 |[2345, 98756] |
|[8756, 45678, 987563]|1234 |[8756, 45678, 987563]|
|[3475, 8956, 45678] |3475 |[8956, 45678] |
|[] |empty |[] |
|null |null |null |
+---------------------+--------+---------------------+
Sorry for scala, I suppose it is quite easy to make the same with pyspark.
Spark < 2.4.0
%pyspark
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
data = [
(["2345","98756","8794"], "8794"),
(["8756","45678","987563"], "1234"),
(["3475","8956","45678"], "3475"),
([], "empty"),
(None,"null")
]
df = spark.createDataFrame(data, ['column_1', 'column_2'])
df.printSchema()
df.show(5, False)
def contains(x, y):
if x is None or y is None:
return x
else:
sx, sy = set(x), set([y])
return list(sx - sy)
udf_contains = udf(contains, ArrayType(StringType()))
df.select("column_1", "column_2", udf_contains("column_1", "column_2")).show(5, False)
result:
root
|-- column_1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- column_2: string (nullable = true)
+---------------------+--------+
|column_1 |column_2|
+---------------------+--------+
|[2345, 98756, 8794] |8794 |
|[8756, 45678, 987563]|1234 |
|[3475, 8956, 45678] |3475 |
|[] |empty |
|null |null |
+---------------------+--------+
+---------------------+--------+----------------------------+
|column_1 |column_2|contains(column_1, column_2)|
+---------------------+--------+----------------------------+
|[2345, 98756, 8794] |8794 |[2345, 98756] |
|[8756, 45678, 987563]|1234 |[8756, 987563, 45678] |
|[3475, 8956, 45678] |3475 |[8956, 45678] |
|[] |empty |[] |
|null |null |null |
+---------------------+--------+----------------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With