Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copy files from s3 to redshift is taking too long

I'm using AWS to COPY log files from my S3 bucket to a table inside my Redshift Cluster. Each file has approximately 100MB and I didn't 'gziped' them yet. I have 600 of theses files now, and still growing. My cluster has 2 dc1.large compute nodes and one leader node.

The problem is, the COPY operation time is too big, at least 40 minutes. What is the best approach to speed it up?

1) Get more nodes ou a better machine for the nodes?

2) If I gzip the files, will it really matters in terms of COPY operation time gain?

3) The is some design pattern that helps here?

like image 976
Rodrigo Ney Avatar asked Oct 28 '25 15:10

Rodrigo Ney


1 Answers

Rodrigo,

Here are the answers:

1 - There is probably some optimization you can do before you change your hardware setup. You would have to test for sure, but after making sure all optimizations are done, if you still need better performance, I would suggest using more nodes.

2 - Gzipped files are likely to give you a performance boost. But I suspect that there are other optimizations that you need to do first. See this recommendation on the Redshift documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-compress-data-files.html

3 -- Here are the things you should look at in order of importance:

  1. Distribution key -- Does your distribution key provide nice distribution across multiple slices? If you have a "bad" distribution key, that would explain the problem you are seeing.
  2. Encoding -- Make sure the encoding is optimal. Use the ANALYZE COMPRESSION command.
  3. Sort Key -- Did you choose a sort key that is appropriate for this table. Having a good sort key can have a dramatic impact on compression, which in turn impacts read and write times.
  4. Vacuum -- If you have been performing multiple tests in this table, did you vacuum between the tests. Redshift does not remove the data after a delete or update(update is processed as a delete and an insert, instead of an in-place update).
  5. Multiple files -- You should have a large number of files. You already do that, but this may be good advice in general for someone trying to load data into Redshift.
  6. Manifest file -- Use a manifest file to allow Redshift to parallelize your load.

I would expect a load of 60GB to go faster than what you have seen, even in a 2-node cluster. Check these 6 items and let us know.

Thanks

@BigDataKid

like image 62
BigDataKid Avatar answered Oct 31 '25 12:10

BigDataKid



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!