Have you ever wondered if incident reviews are more about learning or tracking actions? This question has been a topic of heated discussion in the incident management community, sparking debates at events like SEV0 and in articles like this one by Airbnb engineer Lorin Hochstein. Should incident reviews primarily focus on learning from the situation, or should they prioritize tracking actionable improvements? When is the right time to discuss actions, and are they merely a form of consolation?
In my experience, the process of learning from incidents and identifying actionable steps go hand in hand. Attempting to separate them into distinct discussions or meetings, under the guise of “focusing on learning,” often proves to be counterproductive.
To delve deeper into this topic, let’s dissect what a well-rounded post-incident review should entail.
Running an incident review
Here’s a blueprint for conducting a productive post-incident review. While the specifics may vary based on the organization’s dynamics, this framework has proven effective for most teams.
Incident reviews are typically convened shortly after resolving the initial issue, with key stakeholders and subject matter experts in attendance. These meetings are cross-functional, involving representatives from engineering, customer support, and sometimes leadership.
In my view, it’s best to exclude individuals not directly involved in the incident from these meetings, as their presence can inhibit open dialogue among the team members. Walking through a mishap is stressful enough without feeling like you’re presenting to half of your organization.
The goals of an incident review
The primary aim of an incident review is simple: to align the team on what transpired, enhance understanding, and pinpoint areas for enhancement. Rather than viewing learning and action discussions as conflicting objectives, they should be seen as complementary facets of the same conversation.
Discussing action items
When discussing action items, the focus isn’t on delving into minute details or setting aside dedicated time for this purpose. It’s about exploring potential improvements organically as they arise during the review process—whether it involves preventing recurrences, bolstering reliability, or minimizing future impacts.
Instances where such discussions might emerge include:
- “After deleting the pod, I noticed a missing step in the runbook. We should update it for clarity.”
- “The information seems restricted to the DevEx team group. We should disseminate it more widely in the #engineering channel.”
- “Considering the issue is specific to our version of Postgres, prioritizing an upgrade seems imperative.”
These statements naturally lead to further learning and refinement, enriching the review process rather than derailing it.
Learning from action item discussions
Oftentimes, action items identified during the review trigger additional inquiries that deepen the learning curve. For instance:
- “After deleting the pod, I noticed a missing step in the runbook. We should update it for clarity.”
- “Should individuals follow the runbook without a comprehensive understanding of each step?”
- “The information seems restricted to the DevEx team group. We should disseminate it more widely in the #engineering channel.”
- “How is critical information typically shared with the broader team?”
- “Considering the issue is specific to our version of Postgres, prioritizing an upgrade seems imperative.”
- “What’s our protocol for updating such software? Is there a rationale for sticking to an outdated version?”
By addressing action items at this level, individuals not only suggest improvements but also gain a deeper understanding of the system, fostering a positive outcome.
Addressing common concerns
Having laid out the case, let’s address some common criticisms of this approach:
- “We should prioritize learning over action items.”
I believe these concepts aren’t mutually exclusive. Discussing action items in the context of incident reviews is integral to the learning process. If a conversation naturally leads to a suggestion for improvement, it should be shared rather than suppressed. Similarly, a blameless review can involve naming individuals involved, serving as a valuable learning tool. - “Taking action leads to more incidents, not safety.”
I disagree with this notion. Evolving systems necessitate continuous improvements, and while not all changes post-incident may yield positive results, thoughtful adjustments based on insights are essential. Balancing actions against other priorities is crucial to prevent recency bias from skewing decision-making. - “Tracking actions is merely a form of self-assurance.”
While some actions may be driven by external pressures or for appearances, most are well-intentioned endeavors aimed at enhancing systems. Refining the process to filter out ineffective actions is key, rather than discarding the entire concept of taking action.
Building on Lorin’s insights…