Guide Overview
Incident Management Guide
/

Post-Incident Reviews

The end of an incident should be the beginning of learning. ilert's post-incident analysis and reporting tools enable your team to learn from every incident. Comprehensive timelines, response details gathered from chat channels, and resolution times facilitate a deep understanding of areas for improvement. Utilize templated post-mortem reports to share key findings and transform every challenge into an opportunity for growth.

Why conduct Post-Incident Reviews (Post-Mortems)

Font Awesome Pro 6.4.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license (Commercial License) Copyright 2023 Fonticons, Inc.

What are Post-mortems?

A postmortem, or post-incident review is a blameless analysis conducted after an incident to gain a thorough understanding of what went wrong, why it occurred, and how to prevent its recurrence.

During an incident, the team focuses entirely on restoring service; postmortems provide a platform to evaluate actions and strategies after service has been restored.

They allow us to identify strengths, areas of improvement, and strategies to avoid repeated mistakes in the future.

Conducting a postmortem is not a penalty; it's a collaborative process that involves all responders. While the tech team may lead the analysis, the process's ownership lies with a designated individual, ensuring accountability and driving the postmortem to completion.

A postmortem should be conducted after every significant incident, even if the issue was quickly resolved without intervention. The ideal time for a postmortem is soon after the incident while the event's details are still fresh. It serves as the final step of the incident response process, and any delay can hinder critical learning.

By championing a culture of learning and improvement through postmortems, organizations can enhance their infrastructure and incident response process, ensuring they're better equipped for future incidents.

Postmortem Preparation Steps

1. Assign a Responder Owner and set up a meeting

After the resolution of a major incident, the Incident Response Lead promptly assigns one of the responders to oversee the postmortem process. Although the task of writing the postmortem is a collective effort, having a designated owner is crucial for its effective completion.

The postmortem owner is entrusted with several responsibilities, including:

  • Scheduling the postmortem meeting
  • Investigating the incident (drawing in the necessary expertise from other teams as required)
  • Updating the postmortem document
  • Creating follow-up action items for preventing a similar occurrence in the future.

To facilitate comprehensive analysis and ensure all perspectives are considered, the postmortem meeting should include the following participants:

  • The Incident Response Lead
  • Owners of the services involved in the incident
  • Key engineers/responders who were involved in resolving the incident
  • Engineering and Product Managers for the impacted systems.

The inclusion of these stakeholders encourages a holistic examination of the incident, fostering the cultivation of more robust preventive measures.

2. What happened? Incident Timeline and Impact

After preparing for the postmortem, the next step is to construct a comprehensive timeline of the incident and document its impact.

3. Building the Timeline

Focus on documenting the sequence of events, avoiding any interpretation or judgment about the incident's causes. The timeline should start before the incident's onset and continue through to its resolution, and include significant changes in status or impact, as well as key actions taken by responders.

Review the incident log in your communication tool (e.g. Slack or Microsoft Teams) for crucial decisions and actions. Also include what the team didn't know during the incident that, in hindsight, would have been helpful. You can find this information in monitoring, logs, and deployments of the affected services.

4. Documenting the Impact

Record the impact from multiple perspectives. Detail the duration of the visible impact, the number of customers affected, the number of customers that reported the incident, and the severity of the functional impact.

Quantify impact using a business metric specific to your product. For instance, the effect on API errors, slow performance, or slow notification delivery. If necessary, provide a list of all impacted customers to your support team for further action.

Font Awesome Pro 6.4.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license (Commercial License) Copyright 2023 Fonticons, Inc.

Remember, the goal here is to create an objective, factual record of the incident and its impact. Avoid jumping to conclusions or assigning blame; these steps are purely observational and informational.

5. Root Cause Analysis

Once you have a thorough understanding of the incident's timeline and impact, you'll move onto the Root Cause Analysis (RCA). This stage is to explore the contributing factors that led to the incident, bearing in mind that complex systems don't typically fail due to a singular root cause but a combination of interacting factors.

Monitoring Review

  • Begin the analysis by examining the monitoring of the affected services. Look for irregularities like sudden spikes or flatlining when the incident began and leading up to the incident.
  • Include relevant queries, commands, graph images, or links from monitoring tools to demonstrate how the data was gathered.
  • If monitoring for this service or behavior is absent, include the development of such monitoring as an action item in your postmortem.

Identifying Underlying Causes:

  • After understanding superficial causes, delve into why the system was designed to allow such an incident.
  • Investigate past design decisions, and examine whether they were part of a larger trend or a specific bug or issue.

Evaluation of Process:

  • Consider if the way people collaborated, communicated, and reviewed work contributed to the incident.

This stage is also an opportunity to evaluate and improve the incident response process itself.

Summary of Findings:

  • Write a summary of your findings in the postmortem.

Pre-work and documentation are essential to ensure a productive discussion during the postmortem meeting, although additional insights may emerge during the conversation.

Font Awesome Pro 6.4.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license (Commercial License) Copyright 2023 Fonticons, Inc.

Remember, the ultimate goal of the RCA is to uncover the multiple interacting elements that led to the failure and to inform preventative measures for the future.

6. Create Action Items

After determining the causes of the incident, you need to decide what steps should be taken to prevent similar issues from recurring. Although it may not always be feasible or worthwhile to entirely eliminate the possibility of such incidents, it's essential to consider improving detection and mitigation measures for future events. This includes better monitoring and alerting systems and strategies to reduce the severity or duration of incidents.

Create tickets for all proposed actions in your task management tool. Make sure to provide sufficient context and proposed direction for each ticket, so the product owner can prioritize the task and the assignee can carry it out efficiently. Each action item should be actionable and specific.

Font Awesome Pro 6.4.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license (Commercial License) Copyright 2023 Fonticons, Inc.

Create tickets for all proposed actions in your task management tool. Make sure to provide sufficient context and proposed direction for each ticket, so the product owner can prioritize the task and the assignee can carry it out efficiently. Each action item should be actionable and specific.

If any proposed actions require further discussion before creating tickets, add these items to the postmortem meeting agenda. These could be proposals needing team validation or clarification. Discussing these in the meeting will help decide the best course of action.

Ready to elevate your incident management?
Start for free