top of page
  • Writer's pictureKeet Malin Sugathadasa

Chaos Engineering - The Art of Breaking Things in Production


The software systems that we see today, have advanced as complex systems, where it becomes inherently distributed and dependent on many other platforms in the industry. If you take a simple software system, it will definitely consist of a couple of micro-services, a cloud infrastructure and a mobile setup. Most of these systems are highly dependent on Cloud Service Providers like AWS, Google Cloud, and Azure, where they become a core dependency which makes survival impossible without these platforms...


Whenever you develop a software platform, how confident are you about your system? What if your cloud provider goes down for 8 hours? What if your system load increases by 10 times? You never know until this actually happens in your production environment.


When your system grows with SLAs and customers paying up a licensing fee, they expect your Software platform to be uninterrupted and available for proper business continuity. To provide an uninterrupted service, you need to prepare for any kind of chaos that can happen in production...


This is where Chaos Engineering comes into practice.


Chaos Engineering is the Art of breaking things in Production

Site Reliability Engineering (SRE) plays a vital role in Chaos Engineering, where it is all about ensuring the Reliability of the Site, even if half of the Production System goes down. Sounds unrealistic? Well, this article will give an introduction into Chaos Engineering and how this should be practiced in your organisation to build more resilient systems for your customers.


In this article, I would like to talk about the following topics.


  1. What is Chaos Engineering

  2. How is Chaos Engineering Different from Testing Procedures

  3. What is Chaos Monkey

  4. Principals of Chaos Engineering

  5. Why Do Chaos Engineering?

  6. What Companies Are Doing This.

  7. Challenges Faced in Chaos Engineering

  8. Do you really need Chaos Engineering?


Well,,,, shall we begin then...

 

What is Chaos Engineering


Chaos Engineering is the discipline of experimenting on a system, in order to build confidence in the system’s capability to withstand turbulent conditions in production.

If you've ever had experience running distributed systems in production, you very well know that something is bound to go wrong. This is because, these systems are dependant on so many other components, where this interaction is necessary for its survival and for the fittest functionality. The number of ways your system can go down is enormous. It could be a network failure, IDP failure, unstable pods, surge in user traffic and many more.

When the above incidents start occurring, it gives poor performance, triggers outages and much more. This is why, it is important to identify these issues beforehand and prepare for them, to prevent future outages from happening.

And, most of these platforms have Service Level Agreements (SLAs) tied up with its users, promising uninterrupted service uptime for its users. Violating SLAs is not just about credit discounts, but it is about your reliability and competitiveness in the industry. Furthermore, whether it's bound to a legal document or not, certain performance drops or outages can cost serious losses for an organisation.

Chaos Engineering is the method of simulating these outages in production environment, bringing systematic weaknesses into light. This is an experimentation to ensure that your system can withstand turbulent situations when they occur. Chaos Engineering is an empirical process where verification leads to more resilient systems, and builds confidence in the operational behavior of those systems. It can be as simple as killing a few services, to disconnecting an entire cloud datacenter.


We learn about the behavior of a distributed system by observing it during a controlled experiment. We call this Chaos Engineering.


Chaos doesn't cause problems, it reveals them

As Site Reliability Engineers (SREs), we want to have confidence that our systems are resilient enough to withstand any chaotic situation. With Chaos Engineering, you can address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models.


In a nutshell, Chaos Engineering is....


  • Controlled and planned Chaos Engineering experiments

  • Preparing for unpredictable failure

  • Preparing Engineers for failure

  • Preparing for Game Day

  • A way to improve SLAs by Fortifying Systems


What Chaos Engineering is NOT


These are common misconceptions, and I want to point them out here in this article. The following are NOT Chaos Engineering Practices...


  • Random Chaos Engineering Experiments

  • Unsupervised Chaos Engineering Experiments

  • Unexpected Chaos Engineering Experiments

  • Breaking production by Accident


 

How is Chaos Engineering Different from Testing Procedures


Chaos Engineering is an experimental procedure. There is a fine distinction between testing and experimentation.


In Testing, an assertion is made: given specific conditions, a system will emit a specific output based on the given specifications. Tests are typically binary, and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it.


Experimentation generates new knowledge, and often suggests new avenues of exploration. Chaos engineering refers to the multiple methods to generate something unique. If you want to detect or identify the complexity of any behavioral defection in the system, then injecting communication failures is always a better choice.

Understanding this section is important, because some engineers might directly say that they are confident about their product or system, via proper Unit Testing and Integration Tests. This is true. No argument about that. Testing is the first phase of making sure that you're confident about your system. But this is not enough...


Resilience is about resisting shocks and staying the same. This is only a part of Chaos Engineering. But, the better half is about exploiting the weak points and building a much more confident system on top of them.


 

What is Chaos Monkey


I want to also give a brief introduction into Chaos Monkey, which is very famous and draws history into what Chaos Engineering is really about. Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage.

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:

Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.

Netflix has built an entire army of monkeys, to simulate Chaotic Situations in the production environment, and this is called the Simian Army. Some famous monkeys are...

  • Chaos Kong

  • Chaos Gorilla

  • Latency Monkey

  • Doctor Monkey

  • etc etc...


 

Principals of Chaos Engineering

In this section, I would like to describe the advanced principles of Chaos Engineering and how Chaos Engineering can be practiced in your organisation. Always think of Chaos Engineering as an empirical approach where you explore the weak points of your software system. There are 5 main principles.


  1. Build a Hypothesis around Steady State Behavior

  2. Vary Real-world Events

  3. Run Experiments in Production

  4. Automate Experiments to Run Continuously

  5. Minimize Blast Radius


The entire story of Chaos Engineering is wrapped around the diagram below.



Let's have a look at each of these in detail.


Principle 1: Build a Hypothesis around Steady State Behavior


This section can be broken down into two sections. It is important to identify the "Steady State" of your system and "how to build a hypothesis" around it.


What is Steady State?


Steady state is the state your system is in, when it is considered steady. This is similar to humans as well. We call a human steady when it has certain health conditions. Similarly, this is a measurable output of your systems behavior. Like the overall system’s throughput, error rates, latency percentiles etc. Formulate these numbers into a state and say, our system is steady when it is below this range. An example steady state is given below...


  • 5xx Error rate below 5%

  • p90 latency is below 500 ms

  • Ops per second is above 10,000


Build the Hypotheses


Now that the steady state is finalised, you can simply build multiple hypotheses around it. Think of these as the "what if" questions.


  • What if the load balancer breaks

  • What if the cluster goes down

  • What if the auth server breaks

  • What if Redis becomes slow

  • What if latency increases by 300ms

  • etc

Think of ways that can possibly go wrong in the production environment. But, always make sure of the following....


Don't make an hypothesis that you know will break you...

Why? Because, if you know that it will break you, you can simply fix it or ignore it. You don't really need an experiment to test it out, right? Chaos Engineering experiments could be expensive and catastrophic. Hence, always use them to identify unknown vulnerabilities of your system.


Principle 2: Vary Real-world Events


Always consider events that are plausible and real. This decision can come with years of experience in the industry, where certain events seem realistic and some are not. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.


Some example events are as follows...


  • Hardware failures

  • Functional bugs

  • State transmission errors (e.g., inconsistency of states between sender and receiver nodes)

  • Network latency and partition

  • Large fluctuations in input (up or down) and retry storms

  • Resource exhaustion

  • Unusual or unpredictable combinations of inter-service communication

  • Byzantine failures (e.g., a node believing it has the most current data when it actually does not)

  • Race conditions

  • Downstream dependencies malfunction


Principle 3: Run Experiments in Production


Many software systems we see today, go through different environments and different tests, before it actually reach production. And each of these environments behave differently than the actual production environment. If you want to see what the users actually go through, production environment is your best choice. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic.


But, you might ask the question, why we are we trying to break the production environment. Isn't it risky to perform a chaotic experiment in production? That is true. But, you can never replicate the actual production settings in a different environment. Chaos Engineering wants to capture the loopholes in the production environment. Hence, it is important that this is performed in the production environment. Don't worry! This is done as an experiment in a controlled environment.


Examples of inputs for chaos experiments:


  • Simulating the failure of an entire region or datacenter.

  • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.

  • Injecting latency between services for a select percentage of traffic over a predetermined period of time.

  • Function-based chaos (runtime injection): randomly causing functions to throw exceptions.

  • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.

  • Time travel: forcing system clocks out of sync with each other.

  • Executing a routine in driver code emulating I/O errors.

  • Maxing out CPU cores on an Elasticsearch cluster.


When running experiments in production, it is always recommended to use canary deployments for experiments. You can actually do this to a canary that has the lowest user traffic.



Principle 4: Automate Experiments to Run Continuously


The practice of Chaos Engineering is a long running process and a labour intensive process. Hence, it is important to automate it to avoid engineer burnouts. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.


With each experiment gather important metrics, perform important calculations and persist the information in a suitable location. Some of the example metrics collected from an experiment are as follows. (These can also be considered as results from an experiment).


  • Time to detect

  • Time for Notification and Escalation

  • Time to public notification

  • Time for graceful degradation to kick in

  • Time for self-healing to happen

  • Time to recovery - partial or full

  • Time to all clear and stable


Principle 5: Minimize Blast Radius


Trust me, the last thing you want from Chaos Engineering is to cause actual chaos in your production platform. Even when performing these experiments, it is possible that certain customers feel the degradation of the platform. It is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.


When you perform a Chaos Engineering experiment, always remember to identify metrics like the following. (This is to ensure that the Blast Radius in contained and identified)


  • Who is impacted

  • How many workloads

  • What functionality

  • How many locations

  • And more



 

Why Do Chaos Engineering?


This is a challenging question to answer. But, when you look at your software system as an architect or an engineering manager, you should be able to determine why Chaos Engineering is required for your organisation. I would like to point out some obvious reasons, related to architecture of any software system


  • Systems need to scale fast and smoothly

  • Microservice architecture is tricky

  • Services will fail

  • Dependencies on other companies will fail

  • Reduce the amount of outages and downtime (lose less money)

  • Prepare for real world scenarios

  • Attackers trying to perform DDoS attacks


Not only these, there are other reasons that can help engineers in your organisation to be strong and confident in what they do. This could be the on-call engineers, or even the engineers who actually perform the product development.


  • Train On Call Engineers to be Prepared for Different kinds of Outages

  • Train Development Engineers to build more resilient systems

  • Engineering architects to make solid and reliable decisions


This can also help the Sales team of your company to come up with stronger SLAs and pitch in about how confident you are about your products.


 

What Companies Are Doing This...


Netflix may have started this at first, but this area of specialisation has advanced into many dynamics in industries all over the world. Chaos Engineering is practiced at industries varying from finance, to e-commerce, to aviation, and beyond. Some of the famous software engineering companies who regularly practices Chaos Engineering are as follows.


  • Netflix

  • Amazon

  • Dropbox

  • Uber

  • Slack

  • Twilio

  • Facebook

  • And many more!


Have a look at the industry adaptation into Chaos Engineering and some personalities behind certain initiations. (View Diagram). You can also learn more from the Chaos Engineering Community and Chaos Conf.


 

Challenges Faced in Chaos Engineering


  • No time or flexibility to simulate disasters

  • Teams will always be spending their time fixing things, and building new features

  • This can be very political inside the organization

  • Cost involved in fixing and simulating disasters

  • And many more company matters that build up resistance


 

Do you really need Chaos Engineering?


A simple answer for this would be YES. But, if you actually think about it, some companies don't really need Chaos Engineering and this would be an additional engineering cost that they cannot bare. Let me break down the factors to think about when making this decision. There could be more factors which are not mentioned below...


Does your product have an SLA with its users?


If the answer is yes, then it would be ideal to practice Chaos Engineering to ensure that you provide the agreed availability for your product. Yet again, if your customer base is still small and you can tolerate this kind of downtime, then this can be done a bit later in the roadmap.


Do you have strong competitors in the market?


If you have strong competitors in the market, this would be an essential part in your product to ensure the reliability and resiliency of your product. This would also be a good selling point for your sales team to take your product into market.


How Big is your customer base?


If your customer base is huge and growing, then your system will also have to scale and be distributed as much as possible to provide high availability. Practicing Chaos Engineering would ensure how your system would react to growing user requests and how to polish up the architecture to fit in to the demand.


Do you have an architecture that is high performant, distributed and/or fault tolerant?


In this case, it is very important to ensure that your system has a strong resiliency towards unexpected chaotic situations. Chaos Engineering is a must in this case, to fortify your system for its best performance.


 

References



439 views3 comments

Recent Posts

See All
bottom of page