Chaos engineering best practice

Question

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.

These tools are both fault injection tools, and do nothing to analysis on the tested system.

According to the principles of chaos, we should

1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

2.Hypothesize that this steady state will continue in both the control group and the experimental group.

3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.

Is there any good suggestions or best practice?

Zelldon · Accepted Answer

so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.

As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.

If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.

But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.

At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.

So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.

Chaos engineering best practice

Tags:

chaos

NingLee

1 Answers

Zelldon

Recent Activity

Donate For Us

Chaos engineering best practice

Tags:

chaos

NingLee

1 Answers

Zelldon

Related questions

Recent Activity

Donate For Us