Inner join two data sets using Apache Hadoop Pig

Question

I have two data sets (1M unique string) and (1B unique string); I want to know how many strings are common in both sets, and wondering what is the most efficient way to get the number using Apache Pig?

Amaresh · Accepted Answer

You can first join both the file like below:

A = LOAD '/joindata1.txt' AS (a1:int,a2:int,a3:int);
B = LOAD '/joindata2.txt' AS (b1:int,b2:int);
X = JOIN A BY a1, B BY b1;

Then you can count the number of rows :

grouped_records = GROUP X ALL;
count_records = FOREACH grouped_records GENERATE COUNT(A.a1);

Does it help you problem...

Vignesh I · Answer

Your case doesn't fall under either replicate or merge or skewed join. So you have to do a default join, where in map phase it annotates each record's source, Join key would be used as the shuffle key so that the same join key goes to same reducer then the leftmost input is cached in memory in the reducer side and the other input is passed through to do a join. You could also improve your join by normal join optimizations like filter NULL's before joining and table which has the largest number of tuples per key could be kept as the last table in your query.

Inner join two data sets using Apache Hadoop Pig

Tags:

hadoop

apache-pig

Lin Ma

2 Answers

Amaresh

Vignesh I

Recent Activity

Donate For Us

Inner join two data sets using Apache Hadoop Pig

Tags:

hadoop

apache-pig

Lin Ma

2 Answers

Amaresh

Vignesh I

Related questions

Recent Activity

Donate For Us