Failure test system resistance as a service of a Chaos merchant
Gremlin scenarios are templates to simulate major failures
The engineering company Chaos Gremlin has launched Scenarios – "templates of interruptions in the real world" that facilitate the destruction of their applications.
Gremlin announced the product at the 2019 Chaos Conf that will take place in San Francisco. The scenarios include traffic peaks to prove what happens under severe load; Unreliable networks for when your microservice API calls begin to take years to respond; and evacuation of the region, by the time a cloud region is no longer available.
The idea of chaos engineering is to cause a deliberate failure to investigate whether your application or system is resilient. Chaos engineering tools can consume 100% of the CPU, shut down a percentage of their hosts, make DNS calls not respond or introduce severe latency in networks, so you can find out if planned resistance, such as systems Failover, they really work as designed. in the same way that validates a backup by doing a test restore.
We spoke with Gremlin's senior site reliability engineer (SRE), Tammy Butow, at the Qcon conference in London. "The story begins with Netflix when they moved to AWS," he told us. "They thought, how do we make sure this works? They started by creating Chaos Monkey, which were later open source. That was, if we shut down a server, is everything all right? That helped them provide feedback to AWS."  Chaos Monkey is free but it can be complex to implement.
"We are trying to avoid downtime and we are trying to avoid data loss," Butow added. "Before, when I worked at the National Australia Bank, we did disaster recovery tests. You have to do them to get your bank license. But if you are in a technology startup, there is no one to hold you accountable, to prove that your system is resilient and that he is taking care of his client’s data. "
Faults injected by Gremlin are not simulated, except in the sense that they can be paused or eliminated. "If you do it the wrong way, it can be dangerous," Butow said.
The key is to start small. The "explosion radius" of a test determines how wide its impact is. "I like to do a CPU attack first. It's the Hello World of chaos engineering," Butow said.
You can start by removing only one or two servers, then expand to eliminate entire services or an entire region. A service like Gremlin provides an API and a control plane, so you can automate and schedule tests.
As in the world of security, many failures occur because people use the services unexpectedly. A common example is APIs. "When people create APIs, they don't believe anyone is going to abuse the API," Butow said. "As SRE, I am always looking for how things can be broken."
That a resilient system cannot be called until you've seen it survive massive failures is common sense, but as with backups, many organizations still end up learning hard ways. ®
Beyond the data frontier