Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).
What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?
Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :
create table table_name (
  id                int,
  name              string,
  timestamp         string
)
partitioned by (date string)
Then you load the data creating the partitions dynamically like this
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
       SELECT tno.id, tno.name, tno.timestamp;
Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With