Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark - Calculating average of values in 2 or more columns and putting in new column in every row [duplicate]

Suppose I have a Dataset/Dataframe with following contents:-

name, marks1, marks2
Alice, 10, 20
Bob, 20, 30

I want to add a new column which should have the average of column B and C.

Expected Result:-

name, marks1, marks2, Result(Avg)
Alice, 10, 20, 15
Bob, 20, 30, 25

for Summing or any other arithmetic operation I use df.withColumn("xyz", $"marks1"+$"marks2"). I cannot find a similar way for Average. Please help.

Additionally:- The number of columns are not fixed. Like sometimes it might be average of 2 columns, sometimes 3 or even more. So I want a generic code which should work.

like image 701
Akshat Chaturvedi Avatar asked Oct 16 '25 11:10

Akshat Chaturvedi


1 Answers

One of the easiest and optimized way is to create a list of columns of marks columns and use it with withColumn as

pyspark

from pyspark.sql.functions import col

marksColumns = [col('marks1'), col('marks2')]

averageFunc = sum(x for x in marksColumns)/len(marksColumns)

df.withColumn('Result(Avg)', averageFunc).show(truncate=False)

and you should get

+-----+------+------+-----------+
|name |marks1|marks2|Result(Avg)|
+-----+------+------+-----------+
|Alice|10    |20    |15.0       |
|Bob  |20    |30    |25.0       |
+-----+------+------+-----------+

scala-spark

the process is almost same in scala as done in python above

import org.apache.spark.sql.functions.{col, lit}

val marksColumns = Array(col("marks1"), col("marks2"))

val averageFunc = marksColumns.foldLeft(lit(0)){(x, y) => x+y}/marksColumns.length

df.withColumn("Result(Avg)", averageFunc).show(false)

which should give you same output as in pyspark

I hope the answer is helpful

like image 145
Ramesh Maharjan Avatar answered Oct 19 '25 10:10

Ramesh Maharjan