Overview
This article will introduce the basics of the practice of Failure Testing and show you how to add failure testing to any software system that can be run with Docker.
What is Failure Testing?
Failure testing is the process of exploring how a system behaves when subjected to the application, host, and network problems that software systems experience when running in production and verifying that the system behaves as expected.
Why failure testing?
Failure testing is important because it helps developers and operators understand how a system behaves when conditions range from “not ideal” to “on fire.” Traditional unit, integration, functional, and load testing are important processes, but are typically performed under ideal environmental conditions. Most delivery teams don’t say:
Hey — we’re about to deploy a new build to the test environment for a demo, can you please shut-down a few random servers? -Practically No One
As a result, most testing is done under ideal conditions and effectively determine the best the system can operate, effectively ignoring that hosts, networks, and application software will have problems. For example:
- a virtual machine reboots because the cloud provider needs to update the hypervisor on the physical host
- the optics in a network switch’s uplink path start to go bad resulting in delayed, retransmitted, or dropped packets
- an application pauses for 1-5 seconds for garbage collection under heavy load
How does your software system react when it encounters problems?
Unfortunately, if the system is not tested under typical or worst-case failure conditions in pre-production, your customers will end-up doing it for you in production. Your customers will definitely remember when the system behaves the poorly because they are unable to complete their tasks or have lost work. Additionally, public shaming of Internet services has practically turned-into a sport on social media, so the impact of the incident on the brand’s reputation may last long after the system recovers.
Learning how your system handles failure in production and then fixing it on-the-fly is painful because the team:
- may not have seen this kind-of failure before so it’s difficult to recognize
- instrumentation may not be in the right place to even detect it, so Mean Time to Detection (MTTD) will be high
- the resolution may require a code or system change, so Mean Time to Repair (MTTR) will be high
- customers are experiencing problems so the pressure is on to Fix it Now
You can learn a tremendous amount about a system when it fails:
- does the webpage just go blank or does it fail with an error message?
- how long does it take for a failure message to come back when key processes fail?
- is the monitoring system able to detect this failure condition and make it visible? were the proper alerts sent?
Fortunately, failure testing can be performed in non-production under controlled circumstances the delivery team can perform in-depth explorations of how the system fails safely and efficiently. Once the team is confident the system is resilient, they can even move-on to performing failure testing in production!
How to do Failure Testing
Netflix has made Failure Testing famous with Simian Army and legendary reports of improved resilience. But what if:
- your system does not run on AWS and so cannot use the Chaos Monkey?
- you would like to test the resilience of individual application processes rather than entire systems or datacenters?
- you would like to test during the integration test phase of your continuous delivery pipeline prior to deployment?
Container technology and Docker, in particular, enables you to manipulate the application’s environment in a way that:
- is application and deployment-architecture agnostic – use it with existing applications in any datacenter
- is written-once, deployed many times
- leaves no trace and can be added and removed easily in a number of places and processes:
- Continuous Integration
- functional and load test environments
- production (business hours)
- enables testing in a precise and repeatable manner
- fault-injection process is codified and resource-efficient, e.g. don’t need to overload a database to see determine if application uses timeouts and backpressure properly
- injects faults into the network interfaces or processes for just the applications under test!
- doesn’t require ‘hands’ to unplug a switch or introduce and then remove a soft failure such as packet loss or delay
Let’s get testing!
A fault is something that may go wrong within the system:
- network faults in IP layer
- delayed packets
- packet loss
- sever connection with a clean partition (100% packet loss)
- process faults via signals
- SIGTERM (nice) – ask process to clean-up and exit
- SIGKILL (buh-bye!) – kill process without opportunity for cleanup
- SIGSTOP (pause) – stop executing, i.e. simulate a GC pause!
- SIGCONT (continue) – resume executing, i.e. after a GC pause
Gremlins is an open source project that orchestrates failure-testing by creating lists of faults and recoveries that are triggered periodically and with specifiable probability. The network fault injection techniques from fast.ly’s avalanche tool have been incorporated into the QualiMente fork of gremlins.
If you would like to follow-along, please:
- have docker-compose installed (included with Docker Toolbox)
- clone the QualiMente faulty-cat repo on GitHub which will be used for the examples in this article
git clone https://github.com/qualimente/faulty-cat.git && cd faulty-cat
Introducing Faulty Cat
Faulty Cat is a super-simple client-server application built with netcat and we’ll use it here to demonstrate testing a clean failure of the network between the client and server.
The server listens for and prints messages sent to port 4242:
#!/bin/sh port="4242" echo "starting server on ${port}" while true; do nc -l -p ${port} -vv -k -q 60 done;
The client sends the server a message once per second:
#!/bin/sh while true; do msg="$(date '+%Y-%m-%d %H:%M:%S') hello from ${HOSTNAME}" echo ${msg} echo ${msg} | nc -w 1 server 4242 sleep 1 done;
The client and server are composed-together via the ‘normal’ docker-compose.yml:
version: '2' networks: faulty-cat: services: server: build: server expose: - "4242" hostname: server networks: - faulty-cat client: build: client networks: - faulty-cat links: - server
When started, the application will generate output the application is healthy:
faulty-cat(master) $ docker-compose --file normal.docker-compose.yml up --build --force-recreate Building server ... snip ... Successfully built 762c60da0061 Building client ... snip ... Successfully built e9c560c5a120 Recreating faultycat_server_1 Recreating faultycat_client_1 Attaching to faultycat_server_1, faultycat_client_1 server_1 | listening on [any] 4242 ... client_1 | 2016-04-24 20:10:24 hello from 6ad1b57b142f server_1 | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45306 server_1 | 2016-04-24 20:10:24 hello from 6ad1b57b142f server_1 | sent 0, rcvd 44 server_1 | listening on [any] 4242 ... client_1 | 2016-04-24 20:10:25 hello from 6ad1b57b142f server_1 | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45308 server_1 | 2016-04-24 20:10:25 hello from 6ad1b57b142f server_1 | sent 0, rcvd 44 server_1 | listening on [any] 4242 ...
The server reports that it received hello from 6ad1b57b142f
, which is the client’s Docker-generated hostname. Super — everything is working correctly!
Network Failure Testing
Now let’s test how the application behaves when injecting faults into the network layer with the following gremlins fault profile:
from gremlins import faults, metafaults, triggers, tc clear_network_faults = faults.clear_network_faults() introduce_partition = faults.introduce_network_partition() introduce_latency = faults.introduce_network_latency() FIVE_SECONDS=5 profile = [ # clear any existing configurations triggers.OneShot(clear_network_faults), # every 5 seconds, either clear faults, introduce a latency or a partition # other faults are available, but let's start-simply triggers.Periodic( FIVE_SECONDS, metafaults.pick_fault([ (30, clear_network_faults), (10, introduce_latency), (10, introduce_partition), ])), ]
This fault-injection profile will:
- clear any existing network faults on startup
- every 5 seconds, decide whether to:
- clear network faults, allowing the client and server to communicate normally, with selection weight ’30’
- introduce additional network latency in communication, with selection weight ’10’
- introduce a total network partition preventing communication, with selection weight ’10’
These faults are injected into the application when using the ‘faulty’ docker-compose.yml config which adds the qualimente/gremlins service to the faulty-cat application definition:
version: '2' networks: faulty-cat: services: server: build: server expose: - "4242" hostname: server networks: - faulty-cat client: build: client networks: - faulty-cat links: - server gremlins: image: qualimente/gremlins volumes: - ./server/gremlins/profiles:/app/gremlins/profiles command: gremlins -m gremlins.profiles.faulty_cat -p faulty_cat.profile network_mode: "service:server" cap_add: - NET_ADMIN
Note that the faulty-cat client and server configurations do not change, however the gremlins process is:
- configured with a fault profile defined in the application’s source repository by mounting the server’s gremlins profiles directory as a volume:
volumes: - ./server/gremlins/profiles:/app/gremlins/profiles
- given access to the server’s network interface with
network_mode: "service:server"
- given the capability to administer the network interface with the NET_ADMIN capability:
cap_add: - NET_ADMIN
Now when faulty-cat runs, the output is much different as Gremlins is using the traffic control (tc) program and Linux’ network emulator to inject network faults. The annotated log output below shows:
- the faulty-cat application starting-up with server, client, and gremlins processes
- client sending a message
- server process being partitioned from the client
- client experiences timeouts and inability to connect on second message attempt, remember the client tries in a loop approximately every 1s
- gremlins picks partition fault again and client continues to fail
- at
20:30:15,743
, network faults are cleared and client’s next attempts succeed!
faulty-cat(master) $ docker-compose --file faulty.docker-compose.yml up --build --force-recreate Building server ... snip ... Successfully built 762c60da0061 Building client ... snip ... Successfully built e9c560c5a120 Recreating faultycat_server_1 Recreating faultycat_gremlins_1 Recreating faultycat_client_1 Attaching to faultycat_server_1, faultycat_gremlins_1, faultycat_client_1 server_1 | starting server on 4242 server_1 | listening on [any] 4242 ... gremlins_1 | 2016-04-24 20:30:00,589 tc INFO Clearing network faults gremlins_1 | 2016-04-24 20:30:00,589 procutils INFO running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root'] gremlins_1 | RTNETLINK answers: No such file or directory gremlins_1 | 2016-04-24 20:30:00,642 gremlin INFO Started profile gremlins_1 | 2016-04-24 20:30:00,643 triggers INFO Periodic trigger starting gremlins_1 | 2016-04-24 20:30:00,643 triggers INFO Periodic triggering fault <function do at 0x7fd5777d2320> gremlins_1 | 2016-04-24 20:30:00,643 metafaults INFO pick_fault triggered # client sends a message client_1 | 2016-04-24 20:30:00 hello from acff8eea877c # note: existing network faults are cleared before configuring a new one to avoid them becoming additive gremlins_1 | 2016-04-24 20:30:00,643 tc INFO Clearing network faults gremlins_1 | 2016-04-24 20:30:00,643 procutils INFO running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root'] gremlins_1 | RTNETLINK answers: No such file or directory # introduce a network partition wherein the server will no longer be reachable gremlins_1 | 2016-04-24 20:30:00,692 tc INFO Adding network fault: netem loss 100% gremlins_1 | 2016-04-24 20:30:00,692 procutils INFO running ['tc', 'qdisc', 'add', 'dev', 'eth0', 'root', 'netem', 'loss', '100%'] # client experiences timeout from first message and inability to connect to server on second message attempt client_1 | server [172.18.0.2] 4242 (?) : Connection timed out client_1 | 2016-04-24 20:30:02 hello from acff8eea877c client_1 | server [172.18.0.2] 4242 (?) : No route to host client_1 | 2016-04-24 20:30:04 hello from acff8eea877c # gremlins picks network partition again gremlins_1 | 2016-04-24 20:30:05,706 triggers INFO Periodic triggering fault <function do at 0x7fd5777d2320> gremlins_1 | 2016-04-24 20:30:05,707 metafaults INFO pick_fault triggered gremlins_1 | 2016-04-24 20:30:05,707 tc INFO Clearing network faults gremlins_1 | 2016-04-24 20:30:05,707 procutils INFO running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root'] gremlins_1 | 2016-04-24 20:30:05,712 tc INFO Adding network fault: netem loss 100% gremlins_1 | 2016-04-24 20:30:05,712 procutils INFO running ['tc', 'qdisc', 'add', 'dev', 'eth0', 'root', 'netem', 'loss', '100%'] # client continues failing to connect to the server client_1 | server [172.18.0.2] 4242 (?) : Connection timed out client_1 | 2016-04-24 20:30:06 hello from acff8eea877c client_1 | server [172.18.0.2] 4242 (?) : No route to host client_1 | 2016-04-24 20:30:08 hello from acff8eea877c client_1 | server [172.18.0.2] 4242 (?) : Connection timed out ... snip ~12 seconds of failure ... # gremlins clears network faults, allowing communication to proceed gremlins_1 | 2016-04-24 20:30:15,742 triggers INFO Periodic triggering fault gremlins_1 | 2016-04-24 20:30:15,742 metafaults INFO pick_fault triggered gremlins_1 | 2016-04-24 20:30:15,742 tc INFO Clearing network faults gremlins_1 | 2016-04-24 20:30:15,743 procutils INFO running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root'] client_1 | server [172.18.0.2] 4242 (?) : No route to host # successful message delivery! client_1 | 2016-04-24 20:30:16 hello from acff8eea877c server_1 | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45456 client_1 | 2016-04-24 20:30:16 hello from acff8eea877c server_1 | sent 0, rcvd 44 server_1 | listening on [any] 4242 ... client_1 | 2016-04-24 20:30:17 hello from acff8eea877c server_1 | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45458 server_1 | 2016-04-24 20:30:17 hello from acff8eea877c server_1 | sent 0, rcvd 44 server_1 | listening on [any] 4242 ... client_1 | 2016-04-24 20:30:18 hello from acff8eea877c
Conclusion
Now that faulty-cat has a precise and repeatable way to inject network and other faults into the system, the delivery team can:
- determine whether the application uses fault tolerance and resilience techniques such as timeouts, retries, and restarts properly to meet availability objectives
- verify that the monitoring system detects the sort of problems the team cares-about
- ensure that once changes to the application are made to improve resilience, that the application stays resilient through the integration of failure testing in CI and functional testing processes
All of this can be done simply and without changing the application under test, so you can easily perform failure testing of arbitrary applications!
If you would like to learn more about failure testing with Docker, how to integrate this kind of approach with your application and development processes, or to build a failure testing system that can operate at scale, please contact us. We are looking for a partner to push this space forward!
Resources
The following resources were used or referenced in this article:
- QualiMente faulty-cat application: https://github.com/qualimente/faulty-cat
- QualiMente gremlins fork: https://github.com/qualimente/gremlins
- The original gremlins application by Todd Lipcon:
- source: https://github.com/toddlipcon/gremlins
- Thank you Todd and Cloudera for publishing a great starting-point to explore failure testing.
- fast.ly’s avalanche network failure testing program: https://github.com/fastly/Avalanche