Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split ORC file based on size?

I have a requirement where I want to split 5GB ORC file into 5 files with 1 GB size each. ORC file is splittable. Does that mean we can only split file stripe by stripe ? but I have requirement where I want to split orc file based on size. for ex.split 5GB ORC file into 5 files with 1 GB size each. if possible please share example.

like image 792
Sham Desale Avatar asked Dec 19 '25 00:12

Sham Desale


1 Answers

A common approach and considering that you file size can be 5GB, 100GB, 1TB, 100TB, etc. You might want to mount a Hive table pointing to this file and define one more table pointing to a different directory, then run an insert from one table to the other using insert statement provided by Hive.

At the beginning of the script, make sure you have the following Hive flags:

set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=1073741824;
set hive.merge.size.per.task=1073741824;

In this way, the output average for each reducer will be 1073741824 Bytes which is equal to 1GB.

If you want to use only Java code, play with these flags:

mapred.max.split.size
mapred.min.split.size

Please check these, they are very useful:

  • split size vs block size
  • min split size
like image 135
dbustosp Avatar answered Dec 20 '25 16:12

dbustosp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!