Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Redshift distribute by DISTKEY sequentially?

I have a Redshift table with page hits, like so

CREATE TABLE hits
(
  user_id INT,
  ts TIMESTAMP,
  page VARCHAR(255)
)
SORTKEY(user_id, ts)
DISTKEY(user_id);

Since I'll be running a bunch of window functions over user_id, I thought it would be a good idea to distribute the table by user_id so nodes don't have to exchange data on users before being able to execute the query.

But the users are only ever active for some time and are numbered sequentially. user_id and time are therefore correlated so whenever I run a query that subsets by time (ts) this will lead to skew if Redshift also distributes by user_id sequentially. This would be less of a problem if it distributed by the DISTKEY randomly. My question is: does it?

(I'm new to Redshift so all of this may just be a total misunderstanding of how things work in general. In that case, apologies in advance!)

like image 716
RoyalTS Avatar asked Sep 18 '25 17:09

RoyalTS


1 Answers

Amazon Redshift uses a hash of the DISTRIBUTION KEY (DISTKEY) to distribute data records amongst nodes.

Thus, records will be distributed differently on a 3-node cluster than a 4-node cluster.

If you are seeking evenly-distributed data, use the EVEN distribution method, which simply spreads records evenly between nodes. (However, this is unlikely to be optimal for your use-case.)

See documentation:

  • Choose the Best Distribution Style
  • Amazon Redshift Best Practices for Designing Tables
like image 172
John Rotenstein Avatar answered Sep 21 '25 06:09

John Rotenstein