Security Chaos Monkey

Netflix implemented a wonderfully draconian tool it calls the chaos monkey, meant to drive systems, software and architectures to be robust and resilient, gracefully handling many different kinds of faults. The systems HAVE to be designed to be fault tolerant, because the chaos monkey cometh and everyone knows it.

I believe there is something here for information security. Such a concept translated to security would change the incentive structure for system architects, engineers and developers. Currently, much design and development is based around “best case” operations, knowingly or unknowingly incorporating sub-optimal security constructs.

For lack of a better name, I will call this thing the Security Chaos Monkey (SCM). The workings of a SCM are harder to conceive of than Netflix’s version: it’s somewhat straight forward to randomly kill processes, corrupt files, or shut off entire systems or networks, but it’s another thing to automate stealing data, attempting to plant malware, and so on. In concept, the SCM is similar to a vulnerability scanning system, except SCM’s function is exploitation, destruction, exfiltration, infection, drive wiping, credential stealing, defacement and so on.

One of the challenges with the SCM is the extreme diversity in potential avenues for initial exploitation and subsequent post exploitation activities, many of which will be highly specific to a given system.

Here are some possible attributes of a security chaos monkey:

• Agent attempts to copy random files using random user IDs to some location off the server

• Agent randomly disables local firewall

• Agent randomly sets up reverse shells

• Agent randomly starts listening services for command and control

• Agent randomly attempts to alter files

• Agent randomly connects a login session under an administrative ID to a penetration tester

An obvious limitation is that SCM likely would have a limited set of activities relative to the set of all possible malicious activities, and so system designers may simply tailor their security resilience to address the set of activities performed by SCM. This may still be a significant improvement over current activities.

The net idea of SCM is to impress upon architects, developers and administrators that the systems they build will be actively attacked on a continual basis, stripping away the hope that the system is protected by forces external to it. SCM would also have the effect of forcing our IT staff to develop a deeper understanding of, and appreciation for, attack methods and methods to defend against them.