Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to perform aggregation on 2 values using groupByKey in spark using scala

This question is regarding groupByKey() in spark using scala.

Consider below data

Name,marks,value
Chris,30,1
Chris,35,1
Robert,12,1
Robert,20,1

Created below rdd

val dataRDD = sc.parallelize(List(("Chris",30,1),("Chris",35,1),("Robert",12,1),("Robert",20,1)))

I am trying to create a key value pair of this like

val kvRDD = dataRDD.map(rec=> (rec._1, (rec._2,rec._3)))

Now I want sum of both the values.

val sumRDD = kvRDD.groupByKey().map(rec => (rec._1,(rec._2._1.sum, rec._2._2.sum)))

However, I am facing below error.

<console>:28: error: value _2 is not a member of Iterable[(Int, Int)]

Can't we achieve the required using groupByKey?

like image 758
Umar Avatar asked Dec 17 '25 15:12

Umar


2 Answers

Rather than groupByKey, I would suggest using the more efficient reduceByKey:

val dataRDD = sc.parallelize(Seq(
  ("Chris",30,1), ("Chris",35,1), ("Robert",12,1), ("Robert",20,1)
))

val kvRDD = dataRDD.map(rec => (rec._1, (rec._2, rec._3)))

val sumRDD = kvRDD.reduceByKey{ (acc, t) =>
  (acc._1 + t._1, acc._2 + t._2)
}

sumRDD.collect
// res1: Array[(String, (Int, Int))] = Array((Robert,(32,2)), (Chris,(65,2)))
like image 89
Leo C Avatar answered Dec 20 '25 09:12

Leo C


The value of kvRDD is array of tuple so you can sum array values directly, You can do like below

val sumRDD=kvRDD.groupByKey.map(rec=>(rec._1,(rec._2.map(_._1).sum,rec._2.map(_._2).sum)))

//Output
scala> sumRDD.collect
res11: Array[(String, (Int, Int))] = Array((Robert,(32,2)), (Chris,(65,2)))
like image 34
Manoj Kumar Dhakad Avatar answered Dec 20 '25 08:12

Manoj Kumar Dhakad



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!