I have a Class Library which runs inside a windows service. This library has long running threads to poll email (which could be broken out into tasks), handle messages, etc and works well.
This is part of a product which needs to scale out by adding nodes. I currently define what customers are handled by a single node.
My problem comes if that node goes down, or needs maintenance, manual intervention is needed and data is lost during the downtime. I'd like to come up with a solution that allows it to work like load balanced web servers. If a node goes down, the application can see that and act appropriately.
This is built on C# / .NET and MS SQL Server and would like to stick with those technologies.
I realize this may not be as straight forward as my question seems, but I'm looking for any design patterns or best practices that might be out there to help me build out a solution.
1) Have each installed windows service register itself in the database with a unique id.
2) While your service is alive, send a heartbeat. This heartbeat can be something as simple as an update to a DateTime field of when the service last checked in. You can update a field directly in the database or go through a web service.
3) Create a table that defines a set of tasks, and the assigned unique_id of the machine that's performing that task. This can be first come first serve. A machine can pick up any task it so chooses, and it get exclusive rights to that task by registering itself in this table. I prefer this approach more than a centralize control because you never have to worry about tasks not running when your centralized controller goes down.
4) Define a timeout value for the heartbeat. Each of your distributed service will check for tasks that have either not been picked up or have timed out. The maintenance of the heartbeat for any machine performing a task should not be dependent on how long the task takes. That is, if task A takes 5 minutes, machineA should still update its heartbeat during those 5 minutes so that machineB does not flag it as having gone down.
5) Depending on how complex your task is, you may need a status column that the worker updates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With