I have a simple Cloud Endpoints Restful API that does simple things add an entity, update and entity, delete an entity, search for an entity.
My question is how much traffic can one Google App Engine instance handle? That is, how many API requests before you need another instance?
I know there are different instance classes so let's just use the default B4 one (memory: 512 MB, CPU Limit: 2.4 GHz) 
And I also know this may be a difficult question to answer but given the simple API I described above, could anyone enlighten me on what might be the average number of requests one instance could handle (let's just assume I am not using memcache or any other optimizations)?
Any links to specific documentation would also help greatly, as I am a little confused.
Thank you!
GAE will dynamically spawn more service instances automatically only if the service is configured for automatic or basic scaling, but not for manual scaling.
From Scaling dynamic instances:
The App Engine scheduler decides whether to serve each new request with an existing instance (either one that is idle or accepts concurrent requests), put the request in a pending request queue, or start a new instance for that request. The decision takes into account the number of available instances, how quickly your application has been serving requests (its latency), and how long it takes to spin up a new instance.
Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load.
The actual behaviour also depends on the respective scaling mode configuration parameters, see Change auto scaling performance settings and Scaling elements. And, of course, on how exactly your app code responds to these requests. Very difficult if not impossible to get an exact number.
But what you can do is to actually attempt to measure it: have a test program access your app with the typical kind of requests and with gradually increasing request load, while you watch, in 2 separate browser windows:
You could also check your app's request logs to see how much their handling takes. For some of them you can even see appstats-like tracing in StackDriver. You can also enable appstats to obtain such figures for all your requests.
From these figures you can try to derive some minimum performance values, this time based on the assumption that "instance can handle" means the instance can process the requests fast enough in order to prevent its request queue depth from constantly growing until the instance is killed (which I suspect would be a load level a lot higher than the level triggering dynamic spawning of new instances).
For example handling one type of request in my app takes less than 50ms most of the time on an F1 instance. Since I have threadsafe: true configured it's possible that handling some requests may overlap (how much - I have absolutely no idea). So I can estimate that the F1 instance could handle upwards of 72000 requests of that type per hour. But I also have requests taking on average 1s, the same instance would only be able to handle with certainty some 3600 such requests per hour. As you see a ballpark value doesn't really make a lot of sense.
That's why IMHO the dashboard figures are better than estimating, since they are measurements averaged across the real range/spread of your app's request types and their actual handling. The multithreading gain would be included, for example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With