Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose.
For each and every record I need to check against the HashMap<> contains the corresponding key or not. If it matches with record key then get the value from HashMap.
What is the best way to do this?
Should i broadcast this HashMap and use it inside Transformation? [HashMap or ConcurrentHashMap]
Does Broadcast will make sure the HashMap always contains the value.
Is there any scenario like HashMap become empty and we need to handle that check as well? [ if it's empty load it again ]
Update:
Basically i need to use HashMap as a lookup inside transformation. What is the best way to do? Broadcast or static variable?
When i use Static variable for few records i am not getting correct value from HashMap.HashMap contains only 100 elements. But i am comparing this with 25 Million records.
First of all, a broadcast variable can be used only for reading purposes, not as a global variable, that can be modified in classic programming (one thread, one computer, procedural programming, etc...). Indeed, you can use a global variable in your code and it can be utilized in any part of it (even inside maps), but never modified.
As you can see here Advantages of broadcast variables, they boost the performance because having a cached copy of the data in all nodes, allow you to avoid transporting repeatedly the same object to every node.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
For example.
rdd = sc.parallelize(range(1000))
broadcast = sc.broadcast({"number":1, "value": 4})
rdd = rdd.map(lambda x: x + broadcast.value["value"])
rdd.collect()
As you can see I access the value inside the dictionary in every iteration of the transformation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With