Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compare 2 spark RDD to make sure that value from first is in the range of the second RDD

Tags:

apache-spark

there are 2 very large RDD(each has more than milion records), the first is :

rdd1.txt(name,value):    
chr1    10016 
chr1    10017 
chr1    10018 
chr1    20026 
chr1    20036 
chr1    25016 
chr1    26026
chr2    40016 
chr2    40116 
chr2    50016 
chr3    70016 

rdd2.txt(name,min,max):
chr1     10000  20000
chr1     20000  30000
chr2     40000  50000
chr2     50000  60000
chr3     70000  80000
chr3    810001  910000
chr3    860001  960000
chr3    910001  1010000

the value is valid only when it's in the range between the Min and Max of the second RDD , the count of the name's occurs will plus 1 if its valid

Take the above as an example, the chr1's occurs 7.

how can i get the result in scala with spark?

many thanks

like image 709
Bigo Avatar asked Jan 27 '26 13:01

Bigo


1 Answers

Try:

val rdd1 = sc.parallelize(Seq(
  ("chr1", 10016 ), ("chr1", 10017), ("chr1", 10018)))
val rdd2 = sc.parallelize(Seq(
   ("chr1", 10000, 20000), ("chr1",20000, 30000)))

rdd1.toDF("name", "value").join(rdd2.toDF("name", "min", "max"), Seq("name"))
 .where($"value".between($"min", $"max"))
like image 137
2 revsuser6022341 Avatar answered Feb 02 '26 15:02

2 revsuser6022341