Table of Contents
Why Chaos Engineering?
Even before we get into the definition of Chaos Engineering or why it has become important, let’s take a look at traditional approach. Most of the applications and configuration would be put under stress testing to find out the breakage point. This primarily helped to assure the operations team that the provisioned capacity is enough for the anticipated workload. The tests was relatively (if not fairly) simple to do. But with time there are couple of things that has changed:
- System have become more and more complex now
- Workloads can change abruptly and scaling up and down is a necessity now
Also, there is a philosophical shift happenning the way IT operations used to think –
- Servers are disposable – Earlier the basic deployment units (in most cases physical or virtual servers) were treated like “Pets” and the configuration changes would lead to a snowflake. Now with configuration management tools servers are disposable like “cattles” and can be resurrected from scratch if there is a configuration change aka Pheonix Servers.
- Failure have been accepted as business as usual, outages are not. I am not trying to force you to accept system failures, but most of the IT operations today acknowledges that things would go wrong. Simply put, one needs to be prepared for it.
- Because of the explosion of internet, services are not limited by geographies anymore. Workloads are not predictible anymore and they are bound to go beyond the breakage point of one servers, it is just a matter of time and chance.
- Complexity of applications has increased multi-fold. Today applications are not just three-tier deployments. A web page rendered might be working with 10s or in some cases 100s of micro-services in the backend. Only way test the resiliency of the system is by injecting random issues on purpose.
This all lead the IT Operation leads to be convinced that the best way to be prepared for an outage is to simulate one. If you are not convinced yet, perhaps you want to read a bit about the study of how much loss the business can suffer because of infrastructure outage.
How do you go about it?
So what should be your strategy? I believe the easiest way is to introduce unit testing and integration testing for infrastructure and architecture components too, just like application code. so for any kind of High Availability or Disaster Recovery approach you have implemented, you should have a test case. e.g. if you are having a cluster with 2 nodes, your test case could be shoot down one of the node. Yes, you read it right. I am suggesting that you should take down a node. There is no other way for you to test high availability but to simulate failure. Similarly, you can test scalability but injecting slowness and network congetion.
There are many popular examples and inspirations for Chaos Injection. Most popular one are:
- Generic guidelines are available on Principles of Chaos Engineering
- Netflix’s Chaos Monkey to do various kind of chaos injection e.g. introduce slowness in the network, kill EC2 instances, detach the network or disks from EC2 instances
- Netflix’s Chaos Kong though is not open sourced yet but a nice inspiration and aspiration for anyone embarking on chaos engineering within their enterprise.
- Facebook’s Project Storm
Those who practice chaos engineering by trying to break themselves, have been rewarded well in times of outages. Best example is how Netflix weathered the storm by preparing for the worst.
How does that translate in the container’s world?
In today’s date a lot of new applications and services are being deployed as containers. If you are starting up with Chaos Engineering in Docker, there are many different mechanisms and tools available at your disposal.
Before we get into tools, let’s look at some of the basic features of Docker which should be helpful to you.
1. Docker Service
It is often better to deploy your application as a Swarm Service instead of deploying them as native container. In case you are using Kubernetes, it is better to deploy your request as a sevice. Both the definitions are declarative and define the desired state of service. This is really helpful in maintaining the uptime of your application as the service would always try to maintain the availability of service.
Example
In this example, I am going to use a Dockerfile to build a new image and then I will be using it to deploy a new service. The example is executed against a Docker UCP cluster from a client node (with docker cli and UCP Client Bundle).
Setup a docker build file Dockerfile-nohc
:
Build your image
Now we need to push you image to a repository (DTR or Dockerhub), so that it is available to all nodes:
Now we will start a service using this image:
This request asks the Swarm cluster to setup the service with --mode=replicated
and --replicas=2
i.e. Swarm would try to maintain two tasks for this service at any point of time, unless requested otherwise by the user. You can inspect the tasks running for the service with docker service ps
command:
As you can see there are two tasks running and these tasks would be setup with VIP which will do load-balancing among the two containers/tasks.
Let’s try to kill one of the underlying containers and see if Swarm is able to maintain the declarative state we had requested:
As you can see the container 603c7f8940fe
was used by one of the tasks of our service twet-app
and once we kill the container, Swarm tries to maintain the state by starting another task.
Note: Pushing image to repository is needed when you are running with distributed setup. As you can see above in the build was done on one of the nodes from the Swarm clusterip-10-100-2-106
and image would be only available on only one node. Hence if we were to run service without pushing the image to a repository, there is good chance that the tasks would get started on the same node (ip-10-100-2-106
) i.e. the only node that has access to the image or different nodes would get different images (left by different image builds). Swarm does a good job of reminding us about this. Here is an example if I tried to run the servie without pushing the image:
2. Health Checks
Docker allows you to use healthcheck to keep a tab on the health of running containers. The healthcheck can be either baked into you image during the build process using HEALTHCHECK
direction in Dockerfile or during runtime using –healthcheck option with docker service create
or docker container run
To quote the docker documentation
- The
HEALTHCHECK
instruction tells Docker how to test a container to check that it is still working. This can detect cases such as a web server that is stuck in an infinite loop and unable to handle new connections, even though the server process is still running.
Note: The HEALTHCHECK
feature was added in Docker 1.12.
Build time example of HEALTHCHECK
To make use of this feature we will add a new command to our Dockerfile
now
This means that the healthcheck command python /usr/share/nginx/html/healthcheck.py
will be run for the first time after 30s
i.e. 30 seconds after starting up the tasks. The healthcheck will be run with an interval
of every 30s after that. The healthcheck would timeout
in 3s
and upon failure of 2 retries
the container will be declared unhealthy.
We will have to add a few new files to support HEALTHCHECK
healthcheck.py
– our own little piece of code to check the health of container.healthcheck.html
Now we will build and push the image
Now once we deploy the service, initially the health status would be starting
until the first healthcheck is initiated
After the first healthechk, the healthcheck status would be healthy
Testing Healthcheck and self-healing
Now let’s try to force a distruption by connecting to one of the containers and changing the content of healthcheck.html
Soon (in about 1 minute given our interval
, timeout
and retries
configuration in the Dockerfile), the container will be reported unhealthy and replaced with a new container to run the task
Runtime definition of Healthcheck
You can also override the command to check health, its frequency and retries while creating the service
Disable healthcheck
In runtime while creating a service, you can disable the healtcheck with --no-healthcheck
option. That will supress any healthcheck which has been defined in the base image
If the base container you are going to use has a HEALTHCHECK
defined, it can also disable the healthchek during build time using HEALTHCHECK NONE
Checking the status
You can use docker container inspect
command to further review the state of your containers and details healthchekc command output:
e.g. in case of timeout error:
in case of failures:
in case of no failures
Note: The output will contain a friendly message if one is printed by your healthcheck command.
3. Tooling and Automation
Now that we have covered the basic building blocks of chaos engineering with Docker, let’s try to take a look at some tools. Pumba is a fairly new but quite promising tool for chaos orchestration. Best thing is it works well with a Swarm cluster, you just need to point it to the manager node. We can easily get it to work with Docker UCP Client Bundle.
Example
First, we need to setup an isolated network where we will setup our application and test it out docker network create -d overlay tweet-app-net
Now let’s setup a service using healthcheck from the previous examples
Let’s ensure that the service has been started properly with requested number of replicas which are healthy
Now let’s use Pumba to randomly kill some containers under the service
You will an output confirming that the container has been killed
You will notice that as soon as the container is killed, the swarm manager would try to restore the state back to desired state i.e. with 2 healthy replica
You can also try to stop or remove a container various commands provided by pumba
.
You can also use --interval
option to run the command at a regular interval to perform stress testing. e.g. to run the same kill command every 10minutes
Network delay
Let’s first take example of a simple setup with a single node.
Setup the service by running this command against the single manager node of your newly initiated Swarm Cluster
Fire a pumba command to introduce delays
Monitor the status for containers running the of for the service:
You will notice that becuase of the network delays introduced by pumba, the containers are failing the healthcheck:
Soon the unhealthy container would be removed:
And it will be replaced with a new container:
As soon as the healthcheck is executed, it will turn into a healthy one:
While the container is being replaced, you will notice that pumba command would fail (as the container it attached to has been lost)
As you can see, pumba was able to introduce network delay and HEALTHCHECK
in the image or --health-cmd
at service level helped us to restart the images which were slowing. Well, at this time this is the most that Pumba and Swarm can do. I am hoping in times to come, Swarm service healthcheck would allow us to define auto-scale policies too.
Now, if we are running against a UCP setup or any “true” swarm cluster which has worker and manager nodes, pumba netem command would not work when you fire it from a client. This is unlike the kill command (or most of the other pumba commands), which do work against a Swarm cluster. I came up with a simple solution to work around it.
Pumba in a container
Well you can run pubma in a container as the example says on it’s github page.
# once in a 10 seconds, try to kill (with `SIGTERM` signal) all containers named **hp(something)**
# on same Docker host, where Pumba container is running
$ docker run -d -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba pumba --interval 10s kill --signal SIGTERM ^hp
This means that we can create, a service that runs on each node in your Swarm cluster and executes pumba netem command. We need to change the entrypoint
of the service and mount /var/run/docker.sock
of the local node to container so that pumba can have access to docker deamon on each node.
The pumba command should essentially look for containers that belong to your service only so you need to pass a list of containers to entrypoint pumba command.
The command should try to inject delay only in specific interface i.e. the one used by HEALTHCHECK
.
Now let’s run our pumba netem service
The effect will be same as the previous example we run on one node Swarm Cluster.
If you are scripting this, then introduce a delay and then cleanup the swarm service:
Simulate Packet loss
To be added
The bold test – Node failure
One of the reasont to run your containers in a Swarm cluster is to ensure fault tolerance to node failures. Let’s try to simulate node failure and see how docker UCP manager handles it.
Let’s first list various tasks of our application:
Output would something like below, giving you details of the number of tasks, their id and node on which they are running:
For the purpose of our testing let’s try to fail one of the nodes, let’s say ip-10-100-2-67
.
Since I am running in AWS, I will find out the instance id of the server and restart. We can use docker node ls
before and after restart, to note the node status
As you can see the node became unavailable once the reboot was executed
In order to maintain the desired state of service with 2 replica, Swarm manager would start a new container on one of the surviving nodes
Sameer Kumar – Senior Solution Architect
Sameer Kumar is Database Solution Architect working with Ashnik. He has worked on many complex setups and migration assignments for some of the key customers from Retail, BFSI and Telecom Sector. Sameer is a certified PostgreSQL and EDB Postgres Plus Advanced Server Professional. He is also a certified Postgres Trainer and has delivered many trainings for public and corporate batches. He is well versed with other RDBMS e.g. DB2, Oracle, and SQL Server and is also trained on NoSQL technologies viz MongoDB. He has worked closely with customer and helped them build analytics platform on NoSQL databases and migrate from RDBMS to MongoDB. And while he’s in the free mode, he loves to take his cycle around Singapore for a spin.