Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge two RDDs in Spark Scala

I have two RDDs.

rdd1 = (String, String)

key1, value11
key2, value12
key3, value13

rdd2 = (String, String)

key2, value22
key3, value23
key4, value24

I need to form another RDD with merged rows from rdd1 and rdd2, the output should look like:

key2, value12 ; value22
key3, value13 ; value23

So, basically it's nothing but taking the intersection of the keys of rdd1 and rdd2 and then join their values. ** The values should be in order i.e. value(rdd1) + value(rdd2) and not reverse.

like image 565
user2200660 Avatar asked Dec 21 '25 21:12

user2200660


2 Answers

I think this may be what you are looking for:

join(otherDataset, [numTasks])  

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

See the associated section of the docs

like image 71
Angelo Genovese Avatar answered Dec 23 '25 12:12

Angelo Genovese


Check join() in PairRDDFunctions:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

like image 41
Patrick McGloin Avatar answered Dec 23 '25 14:12

Patrick McGloin