BLOG

6 Steps to Create Actionable Postmortems

June 17, 2024

Share this article:

Table of Contents:

In DevOps and IT operations, conducting a thorough postmortem after an incident is crucial for continuous improvement. This article explores best practices for creating effective postmortems, ensuring that your incident analysis won't be forgotten as soon as the danger has passed but will be comprehensive and actionable.

What is a Postmortem?

A postmortem in DevOps is a structured process conducted after an incident or failure to analyze what happened, identify the root cause, and implement corrective actions to prevent future occurrences. It involves a detailed examination of the timeline, impact assessment, and lessons learned, fostering a culture of continuous improvement and transparency without assigning blame. The postmortem document is the final output of this process, encapsulating all the gathered information, analyses, and planned actions to be shared with relevant stakeholders.

Benefits of Conducting Postmortems

By fostering a culture focused on learning and improvement through postmortems, organizations can strengthen their infrastructure and incident response processes, making them better prepared for future incidents. The benefits of conduction postmortem include:

Improved recovery times.
Enhanced team learning and knowledge sharing.
Prevention of future incidents.
Building a culture of continuous improvement.

ilert feature showcase: create postmortem from an incident — ilert interface: create postmortem right from the incident

Postmortem Key Steps

As it's recommended in ilert's Incident Management Guide, once a major incident is resolved, the incident response lead quickly designates one of the responders to manage the postmortem process.

Step 1: Assigner a Postmortem Owner

While creating the postmortem is a collaborative task, assigning a specific owner is essential for ensuring it is completed effectively. The postmortem owner is entrusted with several responsibilities, including:

‍

Scheduling the postmortem meeting
Investigating the incident (drawing in the necessary expertise from other teams as required)
Updating the postmortem document
Creating follow-up action items to prevent a similar occurrence in the future.

Step 2: Schedule a Meeting

It's crucial to invite people with relevant experience and expertise, so we highly recommend checking that you have the following specialists:

The Incident Response Lead
Owners of the services involved in the incident
Key engineers/responders who were involved in resolving the incident
Engineering and Product Managers for the impacted systems

Step 3: Build a Timeline

Document the sequence of events objectively, without interpreting or judging the causes of the incident. The timeline should begin before the incident starts and continue until it is resolved, noting significant changes in status or impact and key actions taken by responders.

‍

Examine the incident log in Slack or Microsoft Teams for critical decisions and actions. Also, include information that the team lacked during the incident but would have been helpful in hindsight. This information can be found in the monitoring data, logs, and deployments of the affected services.

Step 4: Documenting the Impact

Capture the incident impact from various angles. Note the duration of the observable impact, the total number of affected customers, how many reported the issue, and the severity of the functional disruption. Measure the impact using a business metric relevant to your product, such as the increase in API errors, performance slowdowns, or delays in notification delivery. If applicable, compile a list of all affected customers and share it with your support team for follow-up actions. Including any customer feedback or complaints received during the incident would also be helpful and provide context on user experience.

Step 5: Root Cause Analysis

After thoroughly understanding the incident's timeline and impact, proceed to the Root Cause Analysis (RCA) to explore the contributing factors, recognizing that complex systems often fail due to a combination of interacting elements rather than a single cause. Begin by reviewing the monitoring data of affected services, looking for irregularities such as sudden spikes or flatlining around the time of the incident. Include relevant queries, commands, graphs, or links from monitoring tools to illustrate the data collection process. If monitoring for this service is lacking, list the development of such monitoring as an action item in your postmortem. Next, identify the underlying causes by examining why the system's design allowed the incident, investigating past design decisions, and determining if they were part of a larger trend or a specific issue. Evaluate the processes, considering if collaboration, communication, and work reviews contributed to the incident, and use this stage to improve the incident response process. Summarize your findings in the postmortem, ensuring thorough documentation for a productive discussion during the postmortem meeting while remaining open to additional insights that may emerge.

ilert feature showcase: create postmortem with the help of AI — Generating postmortem using ilert AI

Step 6: Prepare Action Items

‍

Now, it's crucial to determine steps to prevent similar issues in the future. While it may not always be feasible to completely eliminate the possibility of such incidents, focus on improving detection and mitigation measures for future events. This involves enhancing monitoring and alerting systems and developing strategies to reduce the severity or duration of incidents.

‍

Create tickets for all proposed actions in your task management tool, ensuring each ticket includes sufficient context and a proposed direction. This will help the product owner prioritize the task and enable the assignee to carry it out efficiently. Each action item should be specific and actionable.

‍

If any proposed actions require further discussion, add them to the postmortem meeting agenda. These could be proposals needing team validation or clarification. Discussing these items in the meeting will help determine the best course of action.

‍