Suppose I have a Dataset/Dataframe with following contents:-
name, marks1, marks2
Alice, 10, 20
Bob, 20, 30
I want to add a new column which should have the average of column B and C.
Expected Result:-
name, marks1, marks2, Result(Avg)
Alice, 10, 20, 15
Bob, 20, 30, 25
for Summing or any other arithmetic operation I use df.withColumn("xyz", $"marks1"+$"marks2")
. I cannot find a similar way for Average. Please help.
Additionally:- The number of columns are not fixed. Like sometimes it might be average of 2 columns, sometimes 3 or even more. So I want a generic code which should work.
One of the easiest and optimized way is to create a list of columns of marks columns and use it with withColumn
as
from pyspark.sql.functions import col
marksColumns = [col('marks1'), col('marks2')]
averageFunc = sum(x for x in marksColumns)/len(marksColumns)
df.withColumn('Result(Avg)', averageFunc).show(truncate=False)
and you should get
+-----+------+------+-----------+
|name |marks1|marks2|Result(Avg)|
+-----+------+------+-----------+
|Alice|10 |20 |15.0 |
|Bob |20 |30 |25.0 |
+-----+------+------+-----------+
the process is almost same in scala as done in python above
import org.apache.spark.sql.functions.{col, lit}
val marksColumns = Array(col("marks1"), col("marks2"))
val averageFunc = marksColumns.foldLeft(lit(0)){(x, y) => x+y}/marksColumns.length
df.withColumn("Result(Avg)", averageFunc).show(false)
which should give you same output as in pyspark
I hope the answer is helpful
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With