I'm using AWS to COPY log files from my S3 bucket to a table inside my Redshift Cluster. Each file has approximately 100MB and I didn't 'gziped' them yet. I have 600 of theses files now, and still growing. My cluster has 2 dc1.large compute nodes and one leader node.
The problem is, the COPY operation time is too big, at least 40 minutes. What is the best approach to speed it up?
1) Get more nodes ou a better machine for the nodes?
2) If I gzip the files, will it really matters in terms of COPY operation time gain?
3) The is some design pattern that helps here?
Rodrigo,
Here are the answers:
1 - There is probably some optimization you can do before you change your hardware setup. You would have to test for sure, but after making sure all optimizations are done, if you still need better performance, I would suggest using more nodes.
2 - Gzipped files are likely to give you a performance boost. But I suspect that there are other optimizations that you need to do first. See this recommendation on the Redshift documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-compress-data-files.html
3 -- Here are the things you should look at in order of importance:
I would expect a load of 60GB to go faster than what you have seen, even in a 2-node cluster. Check these 6 items and let us know.
Thanks
@BigDataKid
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With