I am not able to understand what this DISTRIBUTE BY clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city), this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:
Say we have a table called data with columns username and amount:
+----------+--------+
| username | amount |
+----------+--------+
| user_1   | 25     |
+----------+--------+
| user_1   | 53     |
+----------+--------+
| user_1   | 28     |
+----------+--------+
| user_1   | 50     |
+----------+--------+
| user_2   | 20     |
+----------+--------+
| user_2   | 50     |
+----------+--------+
| user_2   | 10     |
+----------+--------+
| user_2   | 5      |
+----------+--------+
Now If I say -
SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)
Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?
The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.
“clustered by” clause is used to divide the table into buckets. Each bucket will be saved as a file under table directory. Bucketing can be done along with partitioning or without partitioning on Hive tables. Bucketed tables will create almost equally distributed data file parts.
LIMIT clause is optional with the ORDER BY clause. LIMIT clause can be used to improve the performance.
The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
A question by the OP:
Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?
For 2 reasons:
In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32) 
You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With