Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Hive INSERT OVERWRITE guarantee no duplicate row?

When we use insert overwrite, we found a lot of duplicates. Our SQL is really simple:

insert overwrite table table_c
select 
    table_a.id
    table_a.name
from table_a
left outer join table_b
on table_a.id = table_b.id
where table_b.id is null;

table_a and table_b have no duplicates. When this sql executed, one task failed, but the entire job succeed finally. And we also found that three files under table_c location: 000000_0, 000000_0_copy_1, 000000_1. Does the tasked fail and retry cause duplicates?

Thanks so much.


1 Answers

Under certain circumstances even not failed INSERT OVERWRITE from de-duplicated table can cause duplicates.

For example this behavior: if you are inserting BIGINT values into Int, values will be silently truncated to fit int and newly produced values may be duplicated.

Also if you are using non-deterministic function like rand() in the DISTRIBUTE BY, container restarts can cause the same rows to be calculated differently and distributed to different reducer after restart, this will produce duplicates. But you are not using such function.

In your case it should be fine. Failed tasks are being restarted, their results discarded and should not affect final result. Failures are quite possible because of node went down, long running tasks without heartbeat, some runtime error which was not reproduced after restart, like OOM, task attempt was killed by other process, spot node in AWS cluster was lost, etc. If the job successfully executed, it should be fine. Also files are being written at the end of the whole job, they are being copy from temporary location into table location.

The reason can be duplication in join keys. Check this:

select id, count(*) cnt from table_b group by id having count(*)>1 order by cnt desc limit 100;
like image 133
leftjoin Avatar answered Dec 07 '25 21:12

leftjoin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!