Often chaos engineering is either used as a buzzword to get more VC funding or dreaded as a headache to operators and developers. “We create enough chaos just by releasing software!” But if we can understand the role of Chaos Engineering in regards to Resilience Engineering and take a methodical approach to how we inject system faults, not only will our insights be more actionable but we’ll also be able to build confidence in overall resilience and impact our teams less. In this talk I’ll cover Resilience & Reliability, Resilience & Chaos, focusing on the problem, minimizing your blast radius while also maximizing the insights, monitoring best practices for chaos tests and building enough confidence to test an entire system outage.
通常情况下,混沌工程给我们的印象不是被用来拉风险投资的流行词汇,就是让开发运维头疼的可怕存在。当然你也可以说:“我们仅仅通过发布软件就制造了足够的混乱!”但如果我们能够真正理解混沌工程在弹性工程中扮演的角色,并有条不紊地将故障注入系统,那么这个实验就更具可操作性,更能让我们对整个系统弹性建立信心,那么,团队遭受到的打击也就更少。这个演讲会涵盖弹性及可靠性,弹性及混沌的内容,聚焦于如何在减小爆炸半径的同时也获得更大的观察结果,监控混沌实验的实践,并在整个系统宕掉的时候有足够的信心。