We plan to move Apache Pig code to the new Spark platform.
Pig has a "Bag/Tuple/Field" concept and behaves similarly to a relational database. Pig provides support for CROSS/INNER/OUTER joins.
For CROSS JOIN, we can use alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
But as we move to the Spark platform I couldn't find any counterpart in the Spark API. Do you have any idea?
5) Cross Join: Cross join basically computes a cartesian product of 2 tables. Say you have m rows in 1 table n rows in another, this would give (m*n) rows. So imagine a small table 10,000 customer table joined with a products table of 1000 records would give an exploding 10,000,000 records!
Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN.
The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios are implemented in Spark SQL based upon the business use case. Some of the joins require high resource and computation efficiency.
Anti Join. An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.
It is oneRDD.cartesian(anotherRDD).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With