I have two data sets (1M unique string) and (1B unique string); I want to know how many strings are common in both sets, and wondering what is the most efficient way to get the number using Apache Pig?
You can first join both the file like below:
A = LOAD '/joindata1.txt' AS (a1:int,a2:int,a3:int);
B = LOAD '/joindata2.txt' AS (b1:int,b2:int);
X = JOIN A BY a1, B BY b1;
Then you can count the number of rows :
grouped_records = GROUP X ALL;
count_records = FOREACH grouped_records GENERATE COUNT(A.a1);
Does it help you problem...
Your case doesn't fall under either replicate or merge or skewed join. So you have to do a default join, where in map phase it annotates each record's source, Join key would be used as the shuffle key so that the same join key goes to same reducer then the leftmost input is cached in memory in the reducer side and the other input is passed through to do a join. You could also improve your join by normal join optimizations like filter NULL's before joining and table which has the largest number of tuples per key could be kept as the last table in your query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With