I have data saved as parquet files in Azure blob storage. Data is partitioned by year, month, day and hour like:
cont/data/year=2017/month=02/day=01/
I want to create external table in Hive using following create statement, which I wrote using this reference.
CREATE EXTERNAL TABLE table_name (uid string, title string, value string) 
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET 
LOCATION 'wasb://cont@storage_name.blob.core.windows.net/data';
This creates table but has no rows when querying. I tried same create statement without PARTITIONED BY clause and that seems to work. So looks like issue is with partitioning.
What am I missing?
we can't perform alter on the Dynamic partition. You can perform dynamic partition on hive external table and managed table. If you want to use the Dynamic partition in the hive then the mode is in non-strict mode.
Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.
After you create the partitioned table, run the following in order to add the directories as partitions
MSCK REPAIR TABLE table_name;
If you have a large number of partitions you might need to set hive.msck.repair.batch.size
When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Written by the OP:
This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.
As a workaround, there is another way to add partitions to Hive metastore one by one like:
alter table table_name add partition(year=2016, month=10, day=11, hour=11)
We wrote simple script to automate this alter statement and it seems to work for now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With