Overview

This article will introduce the basics of the practice of Failure Testing and show you how to add failure testing to any software system that can be run with Docker.

What is Failure Testing?

Failure testing is the process of exploring how a system behaves when subjected to the application, host, and network problems that software systems experience when running in production and verifying that the system behaves as expected.

Why failure testing?

Failure testing is important because it helps developers and operators understand how a system behaves when conditions range from “not ideal” to “on fire.”  Traditional unit, integration, functional, and load testing are important processes, but are typically performed under ideal environmental conditions.  Most delivery teams don’t say:

Hey — we’re about to deploy a new build to the test environment for a demo, can you please shut-down a few random servers? -Practically No One

As a result, most testing is done under ideal conditions and effectively determine the best the system can operate, effectively ignoring that hosts, networks, and application software will have problems.  For example:

  • a virtual machine reboots because the cloud provider needs to update the hypervisor on the physical host
  • the optics in a network switch’s uplink path start to go bad resulting in delayed, retransmitted, or dropped packets
  • an application pauses for 1-5 seconds for garbage collection under heavy load

How does your software system react when it encounters problems?

Unfortunately, if the system is not tested under typical or worst-case failure conditions in pre-production, your customers will end-up doing it for you in production.  Your customers will definitely remember when the system behaves the poorly because they are unable to complete their tasks or have lost work.  Additionally, public shaming of Internet services has practically turned-into a sport on social media, so the impact of the incident on the brand’s reputation may last long after the system recovers.

sad-customer-tweet

Learning how your system handles failure in production and then fixing it on-the-fly is painful because the team:

  1. may not have seen this kind-of failure before so it’s difficult to recognize
  2. instrumentation may not be in the right place to even detect it, so Mean Time to Detection (MTTD) will be high
  3. the resolution may require a code or system change, so Mean Time to Repair (MTTR) will be high
  4. customers are experiencing problems so the pressure is on to Fix it Now

You can learn a tremendous amount about a system when it fails:

  • does the webpage just go blank or does it fail with an error message?
  • how long does it take for a failure message to come back when key processes fail?
  • is the monitoring system able to detect this failure condition and make it visible?  were the proper alerts sent?

Fortunately, failure testing can be performed in non-production under controlled circumstances the delivery team can perform in-depth explorations of how the system fails safely and efficiently.  Once the team is confident the system is resilient, they can even move-on to performing failure testing in production!

How to do Failure Testing

Netflix has made Failure Testing famous with Simian Army and legendary reports of improved resilience.  But what if:

  • your system does not run on AWS and so cannot use the Chaos Monkey?
  • you would like to test the resilience of individual application processes rather than entire systems or datacenters?
  • you would like to test during the integration test phase of your continuous delivery pipeline prior to deployment?

Container technology and Docker, in particular, enables you to manipulate the application’s environment in a way that:

  • is application and deployment-architecture agnostic – use it with existing applications in any datacenter
  • is written-once, deployed many times
  • leaves no trace and can be added and removed easily in a number of places and processes:
    • Continuous Integration
    • functional and load test environments
    • production (business hours)
  • enables testing in a precise and repeatable manner
    • fault-injection process is codified and resource-efficient, e.g. don’t need to overload a database to see determine if application uses timeouts and backpressure properly
    • injects faults into the network interfaces or processes for just the applications under test!
    • doesn’t require ‘hands’ to unplug a switch or introduce and then remove a soft failure such as packet loss or delay

Let’s get testing!

A fault is something that may go wrong within the system:

  • network faults in IP layer
    • delayed packets
    • packet loss
    • sever connection with a clean partition (100% packet loss)
  • process faults via signals
    • SIGTERM (nice) – ask process to clean-up and exit
    • SIGKILL (buh-bye!) – kill process without opportunity for cleanup
    • SIGSTOP (pause) – stop executing, i.e. simulate a GC pause!
    • SIGCONT (continue) – resume executing, i.e. after a GC pause

Gremlins is an open source project that orchestrates failure-testing by creating lists of faults and recoveries that are triggered periodically and with specifiable probability.  The network fault injection techniques from fast.ly’s avalanche tool have been incorporated into the QualiMente fork of gremlins.

If you would like to follow-along, please:

  1. have docker-compose installed (included with Docker Toolbox)
  2. clone the QualiMente faulty-cat repo on GitHub which will be used for the examples in this article
    git clone https://github.com/qualimente/faulty-cat.git && cd faulty-cat

     

Introducing Faulty Cat

Faulty Cat is a super-simple client-server application built with netcat and we’ll use it here to demonstrate testing a clean failure of the network between the client and server.

The server listens for and prints messages sent to port 4242:

#!/bin/sh

port="4242"
echo "starting server on ${port}"

while true; do 
  nc -l -p ${port} -vv -k -q 60
done;

The client sends the server a message once per second:

#!/bin/sh

while true; do
  msg="$(date '+%Y-%m-%d %H:%M:%S') hello from ${HOSTNAME}"
  echo ${msg}
  echo ${msg} | nc -w 1 server 4242
  sleep 1
done;

The client and server are composed-together via the ‘normal’ docker-compose.yml:

version: '2'

networks:
  faulty-cat:

services:
    server:
        build: server
        expose:
            - "4242"
        hostname: server
        networks:
            - faulty-cat

    client:
        build: client
        networks:
            - faulty-cat
        links:
            - server

When started, the application will generate output the application is healthy:

faulty-cat(master) $ docker-compose --file normal.docker-compose.yml up --build --force-recreate
Building server
... snip ...
Successfully built 762c60da0061
Building client
... snip ...
Successfully built e9c560c5a120
Recreating faultycat_server_1
Recreating faultycat_client_1
Attaching to faultycat_server_1, faultycat_client_1
server_1  | listening on [any] 4242 ...
client_1  | 2016-04-24 20:10:24 hello from 6ad1b57b142f
server_1  | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45306
server_1  | 2016-04-24 20:10:24 hello from 6ad1b57b142f
server_1  |  sent 0, rcvd 44
server_1  | listening on [any] 4242 ...
client_1  | 2016-04-24 20:10:25 hello from 6ad1b57b142f
server_1  | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45308
server_1  | 2016-04-24 20:10:25 hello from 6ad1b57b142f
server_1  |  sent 0, rcvd 44
server_1  | listening on [any] 4242 ...

The server reports that it received hello from 6ad1b57b142f, which is the client’s Docker-generated hostname. Super — everything is working correctly!

Network Failure Testing

Now let’s test how the application behaves when injecting faults into the network layer with the following gremlins fault profile:

from gremlins import faults, metafaults, triggers, tc

clear_network_faults = faults.clear_network_faults()
introduce_partition = faults.introduce_network_partition()
introduce_latency = faults.introduce_network_latency()

FIVE_SECONDS=5

profile = [
    # clear any existing configurations
    triggers.OneShot(clear_network_faults),
    # every 5 seconds, either clear faults, introduce a latency or a partition
    # other faults are available, but let's start-simply
    triggers.Periodic(
        FIVE_SECONDS, metafaults.pick_fault([
            (30, clear_network_faults),
            (10, introduce_latency),
            (10, introduce_partition),
        ])),
]

This fault-injection profile will:

  1. clear any existing network faults on startup
  2. every 5 seconds, decide whether to:
    1. clear network faults, allowing the client and server to communicate normally, with selection weight ’30’
    2. introduce additional network latency in communication, with selection weight ’10’
    3. introduce a total network partition preventing communication, with selection weight ’10’

These faults are injected into the application when using the ‘faulty’ docker-compose.yml config which adds the qualimente/gremlins service to the faulty-cat application definition:

version: '2'

networks:
  faulty-cat:

services:
    server:
        build: server
        expose:
            - "4242"
        hostname: server
        networks:
            - faulty-cat

    client:
        build: client
        networks:
            - faulty-cat
        links:
            - server

    gremlins:
        image: qualimente/gremlins
        volumes:
            - ./server/gremlins/profiles:/app/gremlins/profiles
        command: gremlins -m gremlins.profiles.faulty_cat -p faulty_cat.profile
        network_mode: "service:server"
        cap_add:
            - NET_ADMIN

Note that the faulty-cat client and server configurations do not change, however the gremlins process is:

  1. configured with a fault profile defined in the application’s source repository by mounting the server’s gremlins profiles directory as a volume:
            volumes:
                - ./server/gremlins/profiles:/app/gremlins/profiles
  2. given access to the server’s network interface with network_mode: "service:server"
  3. given the capability to administer the network interface with the NET_ADMIN capability:
            cap_add:
                - NET_ADMIN

Now when faulty-cat runs, the output is much different as Gremlins is using the traffic control (tc) program and Linux’ network emulator to inject network faults.  The annotated log output below shows:

  1. the faulty-cat application starting-up with server, client, and gremlins processes
  2. client sending a message
  3. server process being partitioned from the client
  4. client experiences timeouts and inability to connect on second message attempt, remember the client tries in a loop approximately every 1s
  5. gremlins picks partition fault again and client continues to fail
  6. at 20:30:15,743, network faults are cleared and client’s next attempts succeed!
faulty-cat(master) $ docker-compose --file faulty.docker-compose.yml up --build --force-recreate
Building server
... snip ...
Successfully built 762c60da0061
Building client
... snip ...
Successfully built e9c560c5a120
Recreating faultycat_server_1
Recreating faultycat_gremlins_1
Recreating faultycat_client_1
Attaching to faultycat_server_1, faultycat_gremlins_1, faultycat_client_1
server_1    | starting server on 4242
server_1    | listening on [any] 4242 ...
gremlins_1  | 2016-04-24 20:30:00,589 tc           INFO     Clearing network faults
gremlins_1  | 2016-04-24 20:30:00,589 procutils    INFO     running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root']
gremlins_1  | RTNETLINK answers: No such file or directory
gremlins_1  | 2016-04-24 20:30:00,642 gremlin      INFO     Started profile
gremlins_1  | 2016-04-24 20:30:00,643 triggers     INFO     Periodic trigger starting
gremlins_1  | 2016-04-24 20:30:00,643 triggers     INFO     Periodic triggering fault <function do at 0x7fd5777d2320>
gremlins_1  | 2016-04-24 20:30:00,643 metafaults   INFO     pick_fault triggered

# client sends a message
client_1    | 2016-04-24 20:30:00 hello from acff8eea877c

# note: existing network faults are cleared before configuring a new one to avoid them becoming additive
gremlins_1  | 2016-04-24 20:30:00,643 tc           INFO     Clearing network faults
gremlins_1  | 2016-04-24 20:30:00,643 procutils    INFO     running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root']
gremlins_1  | RTNETLINK answers: No such file or directory

# introduce a network partition wherein the server will no longer be reachable
gremlins_1  | 2016-04-24 20:30:00,692 tc           INFO     Adding network fault: netem loss 100%
gremlins_1  | 2016-04-24 20:30:00,692 procutils    INFO     running ['tc', 'qdisc', 'add', 'dev', 'eth0', 'root', 'netem', 'loss', '100%']

# client experiences timeout from first message and inability to connect to server on second message attempt
client_1    | server [172.18.0.2] 4242 (?) : Connection timed out
client_1    | 2016-04-24 20:30:02 hello from acff8eea877c
client_1    | server [172.18.0.2] 4242 (?) : No route to host
client_1    | 2016-04-24 20:30:04 hello from acff8eea877c
# gremlins picks network partition again 
gremlins_1  | 2016-04-24 20:30:05,706 triggers     INFO     Periodic triggering fault <function do at 0x7fd5777d2320>
gremlins_1  | 2016-04-24 20:30:05,707 metafaults   INFO     pick_fault triggered
gremlins_1  | 2016-04-24 20:30:05,707 tc           INFO     Clearing network faults
gremlins_1  | 2016-04-24 20:30:05,707 procutils    INFO     running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root']
gremlins_1  | 2016-04-24 20:30:05,712 tc           INFO     Adding network fault: netem loss 100%
gremlins_1  | 2016-04-24 20:30:05,712 procutils    INFO     running ['tc', 'qdisc', 'add', 'dev', 'eth0', 'root', 'netem', 'loss', '100%']

# client continues failing to connect to the server
client_1    | server [172.18.0.2] 4242 (?) : Connection timed out
client_1    | 2016-04-24 20:30:06 hello from acff8eea877c
client_1    | server [172.18.0.2] 4242 (?) : No route to host
client_1    | 2016-04-24 20:30:08 hello from acff8eea877c
client_1    | server [172.18.0.2] 4242 (?) : Connection timed out

... snip ~12 seconds of failure ...

# gremlins clears network faults, allowing communication to proceed
gremlins_1  | 2016-04-24 20:30:15,742 triggers     INFO     Periodic triggering fault 
gremlins_1  | 2016-04-24 20:30:15,742 metafaults   INFO     pick_fault triggered
gremlins_1  | 2016-04-24 20:30:15,742 tc           INFO     Clearing network faults
gremlins_1  | 2016-04-24 20:30:15,743 procutils    INFO     running ['tc', 'qdisc', 'del', 'dev', 'eth0', 'root']
client_1    | server [172.18.0.2] 4242 (?) : No route to host

# successful message delivery!
client_1    | 2016-04-24 20:30:16 hello from acff8eea877c
server_1    | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45456
client_1    | 2016-04-24 20:30:16 hello from acff8eea877c
server_1    |  sent 0, rcvd 44
server_1    | listening on [any] 4242 ...
client_1    | 2016-04-24 20:30:17 hello from acff8eea877c
server_1    | connect to [172.18.0.2] from faultycat_client_1.faultycat_faulty-cat [172.18.0.4] 45458
server_1    | 2016-04-24 20:30:17 hello from acff8eea877c
server_1    |  sent 0, rcvd 44
server_1    | listening on [any] 4242 ...
client_1    | 2016-04-24 20:30:18 hello from acff8eea877c

Conclusion

Now that faulty-cat has a precise and repeatable way to inject network and other faults into the system, the delivery team can:

  1. determine whether the application uses fault tolerance and resilience techniques such as timeouts, retries, and restarts properly to meet availability objectives
  2. verify that the monitoring system detects the sort of problems the team cares-about
  3. ensure that once changes to the application are made to improve resilience, that the application stays resilient through the integration of failure testing in CI and functional testing processes

All of this can be done simply and without changing the application under test, so you can easily perform failure testing of arbitrary applications!

If you would like to learn more about failure testing with Docker, how to integrate this kind of approach with your application and development processes, or to build a failure testing system that can operate at scale, please contact us.  We are looking for a partner to push this space forward!

Resources

The following resources were used or referenced in this article: