I am querying a large table that is partitioned on a field called day.
If I run a query:
select *
from my_table
where day in ('2016-04-01', '2016-03-01')
I get many mappers and reducers and the query takes a long time to run.
If, however, I write a query:
select *
from my_table
where day = '2016-04-01'
or day = '2016-03-01'
I get far less mappers and reducers and the query runs quickly. To me this suggests that in does not take advantage of partitions in a table. Can anyone confirm this and explain why?
Hive Version: 1.2.1
Hadoop Version: 2.3.4.7-4
Details:
I believe the relevant part of the execution plans are...
Using Where or
No filter operator at all
Using Where in Filter Operator
predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean)
Statistics: Num rows: 100000000 Data size: 9999999999
The hive docs just say:
'What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns.'
But don't elaborate. I couldn't find any SO posts directly relating to this.
Thanks!
I am using Hive 1.1.0 with Cloudera 5.13.3 and IN follows the same optimization as the equal operator (=) according to the explain plans I ran in Hue.
My table is partitioned on LOAD_YEAR (SMALLINT) and LOAD_MONTH (TINYINT) and has these two partitions:
load_year=2018/load_month=10 (19,828,71 rows)load_year=2018/load_month=11 (702,856 rows)Below are various queries and their explain plans.
1. Equal (=) operator
Query:
SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
2. IN operator
Query (note that there is no month 12 in the data):
SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
3. Equal (=) in conjunction with AND and OR
Query:
SELECT ID
FROM TBL
WHERE
(LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
4. Arithmetic operation
Query:
SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)
Side note:
100 doesn't have a suffix, so it's an INT, and (LOAD_YEAR * 100 + LOAD_MONTH) is also an INT. This ensures that the result is accurate. Since LOAD_YEAR is a SMALLINT and LOAD_MONTH a TINYINT, arithmetic calculations on the two use SMALLINT for the results and the max value stored is 32,767 (not enough for yyyymm, which needs 6 digits, i.e., at least up to 999,999). With 100 as an INT, calculations are made with the INT type and allow numbers up to 2,147,483,647.
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (201811) IN (201811, 201901) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
Summary
All these queries only scan the second partition, thereby avoiding the ~20 million rows in the other partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With