I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.
For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".
The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):
//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}
def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}
df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))
The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type
It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?
You are overthinking a problem. What you really need here is isin:
val blacklist = Seq("foo", "bar")
$"theStringColumn".isin(blacklist: _*)
Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.
Finally to answer your question you can either:
array().cast("array<string>")
or:
import org.apache.spark.sql.types.{ArrayType, StringType}
array().cast(ArrayType(StringType))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With