Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get minimum value from an Array in a Spark DataFrame column

I have a DataFrame with Arrays.

val DF = Seq(
  ("123", "|1|2","3|3|4" ),
  ("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))

|id           |complete1|complete2|
+-------------+---------+---------+
|          123| [, 1, 2]|[3, 3, 4]|
|          124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+

How do I extract the minimum of each arrays?

|id           |complete1|complete2|
+-------------+---------+---------+
|          123| 1       | 3       |
|          124| 2       | 3       |
+-------------+---------+---------+

I have tried defining a UDF to do this but I am getting an error.

def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)   
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))

val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
like image 539
nickfrenchy Avatar asked Oct 14 '25 03:10

nickfrenchy


1 Answers

Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.

DF.select($"id",
          array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
          array_min(expr("cast(complete2 as array<int>)")).as("complete2"))
like image 198
Semafoor Avatar answered Oct 21 '25 01:10

Semafoor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!