We're developing an online game where players communicate with the server using a persistent TCP connection. Persistent as in, its lifetime is that of a player's session, and if the connection is closed, the player is thrown from the game (though the client will attempt to automatically reconnect).
Now, of course everything works fine in our office (connecting to both testing and live servers), but our client reports that some players get disconnected a lot (every few seconds), and that they experience it themselves too (though their offices are in the same building).
How can I find out the cause of these disconnects? Is it because:
The software is written in Java. It logs when players are disconnected, and if it actively kicks them (e.g. for not sending keep-alive messages) it logs that too.
There are many other online games like ours. How do they deal with this? (Unless the problem is in the server/datacenter, then the solution is obvious)
I would ask players to allow you to enable "anonymous usage data", like many apps do, to periodically upload debugging information from their sessions back to you. This is how you figure out these sorts of situations.
From there, what you'll need when a disconnect happens, is a pretty verbose log. When the disconnect happens, catch whatever exception was thrown (and don't forget to also log the cause via a call to .getCause() - making as many calls to .getCause() as necessary until you've logged all the way back to the root cause), as well as any relevant data you need to match up the client log with the server-side logs. Information you'll likely need includes like session IDs, game IDs, timestamps, etc. Just think, "What information do I think I would need in order to troubleshoot this, assuming I had insight into both sides of the connection?" which is what you'll ultimately get with asking users to upload usage and debugging data.
From there you should be able to figure out at least a few situations where you have control over it - that is, where you can change your client/server code in order to alleviate some of the problems. In some cases, where the problem is either a client's configuration or faulty equipment (or maybe a piece of equipment in between that neither of your control), you'll have to rely on robust re-connectivity.
You'll never reduce disconnects to zero, but this information, after you see enough cases of it, should help you reduce the occurrence of disconnects to the situations that are outside of your control alone, at which point your power to shape the network will ultimately end, and you'll be as close to a "best case scenario" with network reliability as you can be.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With