Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

First-run of queries are extremely slow

Our Redshift queries are extremely slow during their first execution. Subsequent executions are much faster (e.g., 45 seconds -> 2 seconds). After investigating this problem, the query compilation appears to be the culprit. This is a known issue and is even referenced on the AWS Query Planning And Execution Workflow and Factors Affecting Query Performance pages. Amazon itself is quite tight lipped about how the query cache works (tl;dr it's a magic black box that you shouldn't worry about).

One of the things that we tried was increasing the number of nodes we had, however we didn't expect it to solve anything seeing as how query compilation is a single-node operation anyway. It did not solve anything but it was a fun diversion for a bit.

As noted, this is a known issue, however, anywhere it is discussed online, the only takeaway is either "this is just something you have to live with using Redshift" or "here's a super kludgy workaround that only works part of the time because we don't know how the query cache works".

Is there anything we can do to speed up the compilation process or otherwise deal with this? So far about the best solution that's been found is "pre-run every query you might expect to run in a given day on a schedule" which is....not great, especially given how little we know about how the query cache works.

like image 390
Mike G Avatar asked Oct 24 '25 04:10

Mike G


1 Answers

there are 3 things to consider

  1. The first run of any query causes the query to be "compiled" by redshift . this can take 2-20 seconds depending on how big it is. subsequent executions of the same query use the same compiled code, even if the where clause parameters change there is no re-compile.
  2. Data is measured as marked as "hot" when a query has been run against it, and is cached in redshift memory. you cannot (reliably) manually clear this in any way EXCEPT a restart of the cluster.
  3. Redshift will "results cache", depending on your redshift parameters (enabled by default) redshift will quickly return the same result for the exact same query, if the underlying data has not changed. if your query includes current_timestamp or similar, then this will stop if from caching. This can be turned off with SET enable_result_cache_for_session TO OFF;.

Considering your issue, you may need to run some example queries to pre compile or redesign your queries ( i guess you have some dynamic query building going on that changes the shape of the query a lot). In my experience, more nodes will increase the compile time. this process happens on the master node not the data nodes, and is made more complex by having more data nodes to consider.

like image 173
Jon Scott Avatar answered Oct 26 '25 19:10

Jon Scott