there are 2 very large RDD(each has more than milion records), the first is :
rdd1.txt(name,value):
chr1 10016
chr1 10017
chr1 10018
chr1 20026
chr1 20036
chr1 25016
chr1 26026
chr2 40016
chr2 40116
chr2 50016
chr3 70016
rdd2.txt(name,min,max):
chr1 10000 20000
chr1 20000 30000
chr2 40000 50000
chr2 50000 60000
chr3 70000 80000
chr3 810001 910000
chr3 860001 960000
chr3 910001 1010000
the value is valid only when it's in the range between the Min and Max of the second RDD , the count of the name's occurs will plus 1 if its valid
Take the above as an example, the chr1's occurs 7.
how can i get the result in scala with spark?
many thanks
Try:
val rdd1 = sc.parallelize(Seq(
("chr1", 10016 ), ("chr1", 10017), ("chr1", 10018)))
val rdd2 = sc.parallelize(Seq(
("chr1", 10000, 20000), ("chr1",20000, 30000)))
rdd1.toDF("name", "value").join(rdd2.toDF("name", "min", "max"), Seq("name"))
.where($"value".between($"min", $"max"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With