DevOps

Implementing Chaos Engineering on AWS

Content

Chaos Engineering is an approach that seeks to create a plan for the unexpected. This practice, which is still little used, aims to assess a workload ‘s ability to withstand adverse conditions in a controlled manner, simulating real-life scenarios.

Demystifying the Idea of “Breaking Production”

Many associate Chaos Engineering with uncontrolled interruptions in production. However, this perception is mistaken.

Chaos Engineering is not about “breaking production”, but rather a disciplined approach to simulating failures in secure environments. Experiments are carefully planned and the results analyzed to identify vulnerabilities and implement improvements. This practice helps DevOps teams and SREs to:

  • React: Take effective corrective action in the event of failures.
  • Mitigate: Reduce the impact of unexpected events.
  • Anticipate: Preventing problems before they occur.

Understanding failures, whether in code or infrastructure, is crucial to avoiding catastrophes in production. As Laurent Domb defines it, Chaos Engineering is “a mechanism for revealing the ‘known-unknowns’ (things we are aware of, but don’t fully understand) in our environments or ‘unknown-unknowns’ (things we are not aware of, nor do we fully understand)” (source: Chaos Engineering in the cloud | AWS Architecture Blog).

Advantages of Chaos Engineering

Chaos Engineering offers a series of advantages for the teams and organizations that implement it, helping to build more resilient and reliable systems. Here are some of the main benefits:

  • In-depth understanding of the impact of failures: simulating failures in a controlled environment makes it possible to understand how each component of the system behaves in adverse situations. This makes it possible to identify bottlenecks, hidden dependencies and weak points in the architecture.
  • Improved observability: by analyzing system reactions during Chaos Engineering experiments, you identify which metrics are most relevant for monitoring the health of your application and gain valuable insights for improving the monitoring system as a whole.
  • Creating effective contingency plans: based on the results of the experiments, you can develop more robust and efficient contingency plans, preparing the team to deal with different failure scenarios.
  • Increased confidence in the system’s resilience: confidence in the system’s ability to withstand failures increases, reducing the fear of unexpected interruptions and providing greater peace of mind for staff and users.
  • Cost reduction: by identifying and correcting vulnerabilities in advance, you prevent production failures that can lead to financial losses and damage to the company’s reputation.
  • Culture of learning and collaboration: the practice of Chaos Engineering promotes a culture of continuous learning and collaboration between teams, encouraging communication and working together to solve problems.
  • Innovation and continuous improvement: by constantly challenging the system, Chaos Engineering drives innovation and the search for more resilient solutions, stimulating the continuous improvement of architecture and processes.

Chaos Engineering at AWS

When applying Chaos Engineering in cloud environments, it is essential to understand the AWS shared responsibility model:

  • AWS is responsible for the resilience of the cloud infrastructure.
  • You are responsible for the resilience of the resources and services used within the cloud.

A central tool for this practice is the AWS Fault Injection Simulator (AWS FIS), designed to test how your workloads react to adverse events.

AWS Fault Injection Simulator (FIS)

AWS Resislience Hub home screen where you can find the AWS Fault Injection Simulator for Chaos Engineering application
AWS Resilience Hub home screen. FIS is one of the tools that is part of the AWS Resilience Hub, which in turn aims to centralize actions related to application resilience on AWS.

AWS FIS is a tool based on the principles of Chaos Engineering. It creates experiments on real resources to understand and improve the resilience of your workloads.

It’s worth noting that AWS FIS creates real events and applies them to your real resources. Therefore, the recommendation is to plan the execution and use a pre-production environment.

Main Components:

To use FIS you create experiments using the experiment templates, where you define the actions, the target and the stop conditions.

  • Action: Defines what AWS FIS will do during the experiment (e.g. degrade CPU performance).
  • Target: Determines the resource that will be affected (e.g. EC2 instances).
  • Stop Conditions: Triggers to stop the experiment, usually monitored by CloudWatch.

Examples of Actions in AWS FIS

AWS FIS offers a variety of actions to simulate different types of failures. Here are some examples:

  • EC2 instances: Simulate interruption, loss of connectivity or performance degradation.
  • AWS services: Test failures in Amazon S3, DynamoDB or Kinesis.
  • Application errors: Inject latency, simulate API errors or corrupt data in transit.

Success Metrics

Defining metrics is essential for evaluating the results of experiments. Some important metrics include:

  • Latency: Request response time.
  • Error Rate: Percentage of failed requests.
  • Availability: Time the system has been operational.
  • Throughput: Transactions processed per unit of time.

In the diagram below, you can see a demonstration of the action of AWS FIS.

Diagram of how Fault Injection Simulator works in conjunction with other AWS services
Schematic of how AWS Fault Injection Simulator integrates with other AWS services to run experiments. Source: Chaos Engineering in the cloud | AWS Architecture Blog

Implementing Chaos Engineering Gradually

The implementation of Chaos Engineering can be gradual. Start with simple, controlled experiments in test environments. As your team gains experience and confidence, increase the complexity and scope of the experiments, moving on to pre-production environments and, eventually, production.

  1. Start simple: carry out small experiments in test environments.
  2. Increase Complexity: expand into pre-production environments as you gain experience.
  3. Monitor and adjust: use the results to improve your infrastructure.

Conclusion

Chaos Engineering is a powerful tool for building resilient systems on AWS. With AWS FIS, you can simulate failures in a controlled way, identify vulnerabilities and strengthen your infrastructure. By adopting this practice, you ensure that your systems are prepared to face the unexpected and offer the best experience for your users.

Remember:

  • Start with simple, controlled experiments.
  • Define clear metrics to measure success.
  • Use AWS FIS to simulate faults and collect data.
  • Analyze the results and implement improvements.
  • Promote a culture of learning and collaboration.

By following these tips, you’ll be well on your way to mastering the art of Chaos Engineering and building more resilient and reliable systems on AWS.

Have you used Chaos Engineering on AWS? Share your experiences and questions in the comments!

Share this content:

Leave a Reply

Your email address will not be published. Required fields are marked *

Antonio Augusto | DevOps Engineer | AWS Cloud Specialist | DBA | Linux

Passionate about technology and dedicated to building solutions that simplify development and drive projects forward.

Let's Work Together
Let's Work Together

Let's Keep In Touch