One of the most remarkable aspects of the virtual DevOps Enterprise Summit in Las Vegas this year was seeing how large companies were (or have) radically transformed themselves through DevOps practices. A lack of change often characterizes large companies, so seeing government and Fortune 500 companies remodeling themselves into DevOps-driven entities shows how much value there is in its application. There were also war stories from companies -large and small- that didn’t do DevOps right and paid for it. These cautionary tales generally had happy endings, but the willingness to provide these insights to a public audience was amazing. While the large companies, stories of mayhem, stood out, the event’s greatest value was all companies that have been doing DevOps for a while and could share how they’ve figured out how to do DevOps even better.  

So what about the presentation content? It is often said that DevOps is about culture as much as technology or process. It is perhaps not surprising then that many of the talks, even when focusing on technology, were, at least in part, about people. While there are certainly tech tools that mesh well with methodologies (i.e., Lean/Agile) that many people might view as “this is DevOps,” in the end, there was a desire to make the people that are embedded in a DevOps machine happy, fulfilled, and successful. Doing DevOps or not doing DevOps seems to drive executives, developers, and operations engineers often in the opposite direction.  

Thomas Limoncelli (Stack Overflow) presented on “Low Context DevOps: 3 Ways to End Knowledge Frustration.” The talk centered on the importance of realizing that “high context” situations are difficult for people to understand immediately- think of standing confused as a group of close friends laugh at an inside joke. “Low context” situations provide a way to eliminate that newbie confusion from the outset. In DevOps, the low context approach puts information where needed (e.g., error messages, CI/CD control panel messages, alert messages). While part of the overall message was that documentation is important and, as we all know, often ignored or overlooked, putting the documentation in the right place is also critical. Dropping an actual link to additional information about troubleshooting in an error message is an example. Tom provided some implementation suggestions in the slide below.

The low context concept applies across user groups.  Appropriate “signage” can reduce high touch onboarding and customer support to something that becomes much more self-service. For SREs and developers addressing issues, low context processes can provide the necessary references for troubleshooting at the appropriate time rather than requiring the typical searching through code, wikis, and online resources, like Stack Overflow. 

What turned out to be an example of the above was presented by two speakers who both looked to runbooks to provide necessary information to create low context environments for both customers and engineers to resolve incidents. The proposal forwarded by both speakers was that the standard approach to incident management was not automation-ready due to the non-deterministic nature of production systems in a modern CI/CD environment. The slide below from Damon Edwards’ (Rundeck) talk “Runbook Automation: Old News or a Key to Unlock Performance?” shows the conceptual deterministic architecture of an application and the reality of running in production environments. 

Both Edward’s and Tina Huang’s (Transposit) “Speeding to Resolution with Human-in-the-Loop Automation” considered customer service an extension to operations and the difficulty that modern cloud-native applications operations, combined with customer interactions, present to SREs. Both touched on the concept that the ops mantra of “Automate all the Things” may not entirely be the way to go for incident response.  In particular, incidents in the chaotic world of production operations are likely challenging to automate. Edwards called out that the understanding, adapting, and learning aspects are currently most effectively managed by human operators.  Huang made the point that between incident creation and resolution, a human operator’s ability to consider options in a chaotic environment exceeds current automation approaches’ ability.

Both speakers championed the idea of runbooks for human operators that would either enable self-service incident resolution or aid systems operators in more efficiently accessing necessary data about the system.  Clearly, though not explicitly, a call for low context environments for customer service. They also see a role in automation in speeding responses about the system by providing necessary context, options, and guardrails to trigger an automation to resolve issues. In short, a sort of checklist of options guidance, with potential solutions (e.g., scripts) providing push-button automation.  Actual results from applying this approach, provided by Damon, included 60% shorter incidents, 50% fewer escalations, and 99% faster turnaround times.

If you haven’t figured it out, I lean towards low context environments (though I did not know what to call them before Tom Limoncelli’s talk. If I get to find one or two “putting the pieces together for you” talks, I generally consider that the event was worth the price of admission. While I do miss the ability to attend IRL events for the moment, in this well run virtual event, the opportunity to listen to and interact with thought leaders was probably enhanced due to the expectation of virtual interaction.  Looks for more on the Opsani Team’s DOES 2020 takeaways here on the Opsani blog.