Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a typed array column from an empty array

I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.

For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".

The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):

//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}

def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}

df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))

The error message is:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type

It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?

like image 438
Elmar Macek Avatar asked Jan 24 '26 05:01

Elmar Macek


1 Answers

You are overthinking a problem. What you really need here is isin:

val blacklist = Seq("foo", "bar")

$"theStringColumn".isin(blacklist: _*)

Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.

Finally to answer your question you can either:

array().cast("array<string>")

or:

import org.apache.spark.sql.types.{ArrayType, StringType}

array().cast(ArrayType(StringType))
like image 166
zero323 Avatar answered Jan 26 '26 21:01

zero323