Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Merge all Partitions using HIVE CONCATENATE

I have an hive external table, that is partitioned on the source system, meaning data will be coming from multiple source systems. The data directory is structure is:

/app/hadoop/MyProject/SchemaName/TableName/PartitionColumn=SoruceSystem1

/app/hadoop/MyProject/SchemaName/TableName/PartitionColumn=SoruceSystem2

/app/hadoop/MyProject/SchemaName/TableName/PartitionColumn=SoruceSystem3

...

/app/hadoop/MyProject/SchemaName/TableName/PartitionColumn=SoruceSystemN

and all the data that comes into the SourceSystem folders are streaming data, so we get a lot of files under each source system :).

I am looking at merging all of these files once a day, for Example : all the files in SourceSystem1 will be merged and the merged file stays in the SoruceSystem1 folder and so on for others.

Usually: alter table schema.table PARTITION(PartitionColumn={SourceSystemName}) CONCATENATE; works well when there is only one folder, but I need to do it for all the folders at one go.

Workaround would be to write a shell script to loop in through all partitions and then have this statement repeated for every source system name, but am looking for something which comes out of the box to solve this use case.

Any help here is greatly appreciated.

like image 373
suhas Avatar asked Oct 24 '25 22:10

suhas


1 Answers

IMPORTANT - If your table schema has evolved (e.g. new columns added) but still contains files using the old schema. CONCATENATE might silently drop those files. See Bug report. Workarounds were done to fail this operation if the table is unmanaged, and was finally fixed in Hive 3.0.0.

Now that that's out of the way, we need to do this in two steps.

First, we obtain the partitions within the table in question and write them to a text file that we can refer to later.

beeline --showHeader=false --outputformat=tsv2 --silent=true -e "SHOW PARTITIONS database.table" > found_partitions.txt

This will write the list of found partitions without a header or frame.

Next, we need to iterate through the list of partitions, swap potential partition separator (part1=some/part2=thing) with a comma, since the former isn't a legal Hive character. This does nothing if you only have a single partition structure in your table. We are also assuming that all your partitions are strings and will need to be surrounded with quotes.

#!/bin/bash

for line in `cat found_partitions.txt`; do
    echo "the next partition is $line"
    partition=`(echo $line | sed -e 's/\//,/g' -e "s/=/='/g" -e "s/,/',/g")`\'
    beeline -e "ALTER TABLE database.table PARTITION($partition) CONCATENATE" 
done

Note: You might have to set some configurations for beeline to work for you. Might wanna set an alias for this.

 beeline -u "jdbc:hive2://<SERVER>:<PORT>/;serviceDiscoveryMode=<zooKeeper>;zooKeeperNamespace=<hiveserver2>;principal=<USER>;transportMode=<SOMETHING>;httpPath=<SOMETHING>"
like image 69
kfkhalili Avatar answered Oct 27 '25 12:10

kfkhalili



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!