Try ilert AIOps

All-in-one Incident Management Platform

Manage on-call, respond to incidents and communicate them via status pages using a single application.

Trusted by leading companies

Highlights

The features you need to operate always-on-services

Every feature in ilert is built to help you to respond to incidents faster and increase uptime.

Harness the power of generative AI

Enhance incident communication and streamline post-mortem creation with ilert Al. ilert AI helps your business to respond faster to incidents.

Read more
Integrations

Deploy in minutes with 100+ ready-to-use integrations

ilert seamlessly connects with your tools using out pre-built integrations or via email. ilert integrates with monitoring, ticketing, chat, and collaboration tools.

Transform your Incident Response today - start free trial
Start for free
Customers

See how industry leaders achieve 99.9% uptime with ilert

Organizations worldwide trust ilert to streamline incident management, enhance reliability, and minimize downtime. Read what our customers have to say about their experience with our platform.

Stay up to date

Expert insights from our blog

Product

Postmortem Template to Optimize Your Incident Response

Discover key elements of a postmortem template and get a free download to improve incident response—even without an incident management platform.

Marko Simon
Apr 01, 2025 • 5 min read
Download postmortem template

A postmortem template is a structured tool for documenting incidents, understanding their causes, and learning how to prevent them in the future. This article explains the essential elements of an effective postmortem and how ilert can streamline this process, making your incident response more efficient. It also offers a downloadable version of a postmortem template that you can use if you haven't yet utilized an incident management platform in your organization.

Key takeaways

  • Postmortem templates turn incidents into valuable learning opportunities, helping teams identify vulnerabilities and improve future responses.
  • Postmortems are used for further improvements within the teams and external communication with stakeholders.
  • Key elements of an effective postmortem include an incident timeline, impact and mitigation details, and a root cause analysis for continuous improvement.
  • ilert streamlines the postmortem process by automating data collection and promoting a blameless culture that focuses on learning rather than assigning fault.

The importance of an incident postmortem in incident management

Postmortems are more than just documents; they’re blueprints for turning incidents into invaluable learning opportunities. Documenting incidents in a structured manner helps pinpoint system vulnerabilities and enhance your team’s future responses. This method not only resolves current issues but also serves as a crucial reference for managing future incidents effectively.

Consider the chaos of an incident: systems failing, users affected, and the clock ticking. When the dust settles, a well-crafted postmortem template helps you make sense of the madness. It provides a clear, step-by-step account of what happened, why it happened, and how project management can help prevent it from happening again. Such a structured approach transforms a negative event into a positive learning experience.

Moreover, having a consistent incident postmortem process ensures that every incident is analyzed comprehensively. This consistency helps teams identify patterns and recurring issues, leading to more effective and proactive incident management.

Key elements of an effective postmortem template

Creating an effective postmortem template starts with a clear title and introduction that summarizes the incident. This sets the stage for anyone reading the document, providing immediate context.

Following this is the incident timeline—a chronological account of events leading up to and during the incident, complete with timestamps. This section is crucial for understanding the sequence of events and identifying contributing factors and potential triggers.

The impact and mitigation section is another critical component. Here, you detail the effects of the incident on users and describe the immediate corrective actions taken. This section helps teams understand the real-world implications of the incident and the effectiveness of their initial response.

Root cause analysis and lessons learned are the heart of any postmortem template. By identifying the root cause, teams can implement measures to prevent similar incidents in the future. Lessons learned provide valuable insights into what worked well and what didn’t, fostering a culture of continuous improvement.

Using a consistent format in postmortem documentation facilitates thorough analysis and more effective incident management. Regularly updating the template based on feedback and outcomes from previous postmortems further enhances its effectiveness. Ultimately, an effective postmortem template is not just a document; it’s a dynamic tool for continuous learning and improvement.

ilert's built-in postmortem feature

ilert takes the hassle out of creating postmortem documents. It automatically gathers data from various incident-related communications and status updates, making the documentation process seamless. This feature is a lifesaver when you’re dealing with the aftermath of an incident and need to focus on analysis rather than data collection.

Integration with chat tools like Slack and Microsoft Teams further streamlines the process. ilert can automatically compile alerts triggered during incidents and include relevant messages from linked channels. This means you don’t have to manually sift through endless chat logs to find pertinent information.

Once the document is generated, its status transitions to “created,” and users can view a simplified markdown version or access the raw text file for further adjustments. This flexibility allows teams to fine-tune the document before sharing it with stakeholders, ensuring that it meets all requirements and provides valuable insights into the development process.

Moreover, ilert allows you to link postmortems to specific incidents and publish them on all relevant status pages. This ensures everyone is aligned and has access to the postmortem report. Making the postmortem process more efficient, ilert helps teams concentrate on identifying root causes and areas for improvement.

Example incident and postmortem document creation with ilert

Let's imagine the following incident scenario to show you ilert in action and help you better understand the structure of the postmortem process.

Incident scenario

Company XY is a website hosting service that utilizes a cloud provider to host and deliver their customers’ websites. They get notified about any incidents on the cloud provider's site.

In the late afternoon, several alerts were created in ilert signaling unreachable customer websites. About half of the customers were impacted. The issue was escalated by the responder, creating an incident. Gregory created an incident and set the status to "Investigating." This was immediately reflected on the status page. After identifying the cause of the problem, the status was changed to "Identified" to keep the users informed. Later, Francesca chimed in, got info from the provider, and set the status to "Monitoring." After 1,5 hours, the incident was resolved, and Francesca put the status to "Resolved."

(By the way, if you are feeling lost identifying the difference between alerts and incidents, we have a dedicated article. Shortly, alerts are technical signals from monitoring tools, while incidents stand for the disruptions that impact users and must be communicated).

The illustrations below show the whole process vividly.

Postmortem creation with ilert
The team receives alerts and communicates via ilert incident management platform
Incident creation in ilert
An incident is created in ilert
How to create a postmortem automatically
The incident is resolved
Generate postmortem using ilert AI
Automatic postmortem generation with ilert AI
Postmortem template from ilert
A preview of the postmortem document created with ilert AI

Automatic postmortem creation

After the dust had settled, engineers created a postmortem report. ilert reviewed all available information, including alert details, logs, messages, and status updates, and prepared a clear, structured post-mortem document.

All postmortems are saved in ilert. However, users can also download or save it as a plain text.

# [00000 Partial data center outage causing some websites to be down.](https://test.ilert.com/incidents/view?id=000)
Generated by Francesca Sala on 18.03.2025 17:40.
All timestamps are local to Europe/Berlin.

# Post-Mortem Document

## Incident Timeline

### March 18, 2025
- **14:26:24.109Z**: Received event from alert source indicating website thernos.com is down.
- **14:26:25.426Z**: Francesca Sala notified via email.
- **14:26:25.437Z**: Gregory George notified via email.
- **14:26:24.129Z**: Assigned to Gregory George.
- **14:27:06.664Z**: Accepted by Gregory George.
- **14:33:52.317Z**: Gregory George linked incident 'Partial data center outage causing some websites to be down' to this alert.
- **14:36:46.682Z**: Gregory George changed linked incident status to Identified.
- **14:59:00.145Z**: Gregory George added a comment regarding an email from Thernos asking for an estimate on website restoration.
- **15:00:28.502Z**: Francesca Sala added a comment indicating the provider is restarting affected regions.
- **15:09:21.785Z**: Francesca Sala changed linked incident status to Monitoring.
- **16:03:51.741Z**: Francesca Sala changed linked incident status to Resolved.
- **16:06:36.737Z**: Francesca Sala added a comment indicating the incident is resolved and the website is online again.
- **16:06:36.737Z**: Incident resolved by Francesca Sala.

### March 18, 2025 (Additional Alerts)
- **14:26:30.692Z**: Received event from alert source indicating website akisp.com is down.
- **14:26:31.884Z**: Francesca Sala notified via email.
- **14:26:31.887Z**: Gregory George notified via email.
- **14:26:30.705Z**: Assigned to Gregory George.
- **14:27:06.640Z**: Accepted by Gregory George.
- **14:33:48.699Z**: Gregory George linked incident 'Partial data center outage causing some websites to be down' to this alert.
- **14:36:46.699Z**: Gregory George changed linked incident status to Identified.
- **15:09:21.813Z**: Francesca Sala changed linked incident status to Monitoring.
- **16:03:51.770Z**: Francesca Sala changed linked incident status to Resolved.
- **16:06:36.524Z**: Francesca Sala added a comment indicating the incident is resolved and the website is online again.
- **16:06:36.524Z**: Incident resolved by Francesca Sala.

### March 18, 2025 (Additional Alerts)
- **14:26:36.713Z**: Received event from alert source indicating website kontore.com is down.
- **14:26:37.916Z**: Gregory George notified via email.
- **14:26:37.923Z**: Francesca Sala notified via email.
- **14:26:36.737Z**: Assigned to Gregory George.
- **14:27:06.602Z**: Accepted by Gregory George.
- **14:33:08.523Z**: Gregory George linked incident 'Partial data center outage causing some websites to be down' to this alert.
- **14:36:46.716Z**: Gregory George changed linked incident status to Identified.
- **15:09:21.837Z**: Francesca Sala changed linked incident status to Monitoring.
- **16:03:51.802Z**: Francesca Sala changed linked incident status to Resolved.
- **16:06:36.209Z**: Francesca Sala added a comment indicating the incident is resolved and the website is online again.
- **16:06:36.209Z**: Incident resolved by Francesca Sala.

## Impact

The incident caused a partial outage in one of our data centers, affecting the availability of several customer websites, including Thernos, Akisp, and Kontore. Approximately half of our hosted sites were down, leading to customer inquiries and potential business disruptions. The affected websites experienced degraded performance and were unreachable for a period of time, causing inconvenience to users and potentially impacting business operations for the affected customers.

## Root Cause Analysis

The root cause of the incident was identified as an issue with our data center provider. The provider experienced an outage in one of their data centers, which led to the unavailability of several hosted websites. The provider worked on resolving the issue by restarting the affected regions, which eventually restored the services.

## Action Items

1. **Monitoring Provider Status**: Francesca Sala will continue to monitor the cloud provider's status page for updates during incidents.
2. **Customer Communication**: Gregory George will draft and update the status page to keep customers informed during incidents.
3. **Incident Documentation**: Francesca Sala will create and share a post-mortem document after the incident is resolved.

This post-mortem document provides a detailed account of the incident, its impact, root cause, and the actions taken to prevent recurrence.

Use ilert or download a postmortem template and fill in manually

Based on this example, we prepared a Google Docs template that you can use if you are not yet utilizing the ilert incident management platform. While assembling and writing all the information manually will be more time-consuming, it is still the first step to better arranging post-incident learnings and preparing for the next challenges.

Download a postmortem template.

A few words on blameless postmortems and blameless culture

A blameless postmortem focuses on collective learning and improvement rather than assigning fault to individuals. This approach fosters a supportive work environment and encourages team members to be honest and open during the postmortem process. Instead of pointing fingers, the focus is on understanding what happened and how to prevent it in the future.

Asking "what" and "how" questions instead of "who" during postmortem meetings helps analyze incidents without attributing blame. This promotes a growth mindset and fosters a culture of continuous improvement. A "no argument" policy during discussions ensures the focus remains on process improvement rather than assigning blame.

Utilizing data-driven insights, ilert AI provides unbiased evaluations of incidents, eliminating personal biases in reporting. This also helps create a blameless culture where the ultimate goal is to learn from incidents and improve future responses rather than playing the blame game.

Common pitfalls to avoid in postmortem document creation

To maximize the value of your postmortems, avoid these key pitfalls—ranked by their impact on long-term learning and operational resilience:

Not analyzing patterns across incidents

  • Treating each incident in isolation can hide recurring issues.
  • Regularly review multiple postmortems to detect patterns, systemic weaknesses, or process gaps.
  • Use this insight to inform broader improvements and prevent similar incidents in the future.

Failure to follow up on action items

  • Insight is meaningless without execution. If postmortem action items aren’t completed, incidents are likely to repeat.
  • Always assign owners and due dates, and track completion progress.

Using a generic template

  • A one-size-fits-all postmortem template may omit crucial incident-specific details.
  • Customize templates to include everything relevant—like timeline, impact, contributing factors, and remediation steps.

Lack of a blameless culture

  • If people feel blamed, they’re less likely to share honestly.
  • Promote a culture of psychological safety and learning, not punishment.

Vague or unconstructive feedback

  • Feedback that lacks clarity or actionability won’t lead to meaningful change.
  • Encourage specific, constructive feedback that points to clear improvements.

Poor stakeholder communication

  • Not sharing postmortems with key stakeholders reduces organizational learning.
  • Proactively circulate findings to relevant teams, leadership, and other affected parties to keep everyone aligned.

Summary

Postmortem templates are essential tools for transforming incidents into learning opportunities. By documenting incidents in a structured manner, teams can identify system vulnerabilities, improve future responses, and foster a culture of continuous improvement. ilert’s built-in features and AI enhancements make the postmortem process seamless and efficient, allowing teams to focus on what really matters.

Implementing a formal postmortem process and avoiding common pitfalls ensures that every incident becomes a stepping stone toward success. By embracing a blameless culture, teams can learn from their experiences and drive better outcomes. Remember, the ultimate goal is to turn every failure into an opportunity for growth and improvement.

Frequently Asked Questions

What is the purpose of using ilert AI in postmortem creation?

Using ilert AI for postmortem creation speeds up the process of the final stage of incident response, letting you focus on evaluating the incident instead of spending ages on paperwork. It's all about getting to the good stuff quicker!

What happens after an incident reaches the "Resolved" state?

Once an incident hits the "Resolved" state, the team collects all the relevant details and documents everything discussed to ensure everyone is on the same page. ilert users skip the manual part of work and jump right to the discussions and action items execution.

What information does ilert AI consider when generating a postmortem document?

Ilert AI generates a postmortem document by considering the incident's context, including history updates, Slack or Microsoft Teams messages, subscribers, services, involved users, and any linked alert details.

How can users include relevant messages from communication channels in their postmortem document?

You can easily add relevant messages to your postmortem by linking your Slack or Microsoft Teams channels, which the ilert bot will scan for you. Alternatively, copy and paste chat transcripts manually from anywhere you need.

Insights

Incident Response Management: A Category of Its Own

As Atlassian phases out Opsgenie, teams are rethinking incident response. Is IRM just a feature or a category of its own? This article explores that question, with insights from Opsgenie users migrating to ilert and a look at ilert’s vendor-neutral philosophy.

Birol Yildiz
Mar 28, 2025 • 5 min read

In recent weeks, I’ve spoken with several Opsgenie customers who are evaluating a migration to ilert after Atlassian’s decision to phase out Opsgenie and fold its functionality into other products. Atlassian is giving Opsgenie users “two options: move to Jira Service Management for robust end-to-end incident management, or move to Compass for alerting and on-call management.” This has raised a broader question in our industry: 

Is Incident Response Management (IRM) a standalone category or just a feature within larger platforms?

I want to reflect on that question and share why I firmly believe IRM remains a distinct, essential category—not merely a feature. I’ll highlight insights from those customer conversations and explain ilert’s vendor-neutral approach to integrations, which even led us to sunset our own uptime monitoring feature for the greater good of our ecosystem.

What Opsgenie’s transition taught us

First, let’s consider the insight from Opsgenie’s end-of-life. Along with PagerDuty, Opsgenie was a pioneer that helped build the incident response management category, so seeing it put on the shelf is bittersweet. Many of its users have expressed frustration that development stagnated as Atlassian integrated Opsgenie’s features into Jira Service Management (JSM). In fact, we have had customers switching to ilert way before Atlassian’s EOL announcement of Opsgenie, “citing Opsgenie’s stagnation as Atlassian folded its features into Jira Service Management.” 

This sentiment captures the crux of the issue: the all-in-one solution offered in JSM may include incident response features, but it can be cumbersome for teams that primarily need a nimble, real-time alerting and on-call management tool. Opsgenie’s fate illustrates the dilemma. Atlassian’s strategy treats incident management as a component of a broader suite (ITSM or a developer portal like Compass) rather than a product in itself. 

Opsgenie users I spoke with are weighing these Atlassian-provided paths, but many are also looking at dedicated IRM platforms because they feel something would be lost in translation if incident response became just another module inside a larger tool. Their intuition aligns with what we’ve long believed in the industry.

IRM: Feature or Standalone Platform?

It’s a fair question to ask: As adjacent software categories mature, could incident response simply become a feature of monitoring, observability, or ITSM platforms? After all, many monitoring tools now have alerting capabilities, and IT service platforms have incident modules. Atlassian’s move with Opsgenie is one prominent example of viewing IRM as a feature within a bigger product.

However, there’s a reason dedicated IRM platforms like ilert, PagerDuty and xMatters exist (and continue to thrive). The nature of incident response—bridging humans and complex systems under pressure—calls for a specialized focus. Treating IRM as just a checkbox feature risks oversimplifying what it does. The core value of an IRM platform is to act as the central dispatcher between people and systems during critical moments. This goes far beyond what a typical add-on feature can accomplish.

Let’s unpack that with an analogy: You wouldn’t consider “customer support” just a feature of your email service, even though you can technically manage support via email. Companies still invest in dedicated support platforms because specialization matters. Similarly, incident response has its own workflows and urgency that warrant a purpose-built solution.

Why Incident Response Management remains a distinct category

In my view, IRM stands as a distinct category for several key reasons:

  • Centralized alert dispatching: A true IRM platform serves as a hub for all critical alerts, regardless of source. It funnels signals from various monitoring, observability, and automation tools into one stream and ensures they reach the right people at the right time. This “single pane of glass” for incidents is difficult to achieve when incident management is scattered across different modules in different systems. Neither JSM nor Compass alone covers the need for a centralized alert dispatcher and incident management. By contrast, a dedicated IRM tool is built from the ground up to be that centralized dispatcher.
  • Specialized on-call and escalation workflows: IRM platforms provide rich capabilities like on-call scheduling, rotation management, multi-step escalations, automated stakeholder notifications, and postmortem tracking. These aren’t side features; they are the heart of the product. When incident response is a mere feature elsewhere, these capabilities often end up less flexible or buried behind other priorities. A distinct IRM system keeps the focus on minimizing response times and coordinating people efficiently during high-stress incidents—its entire roadmap revolves around these outcomes, not around broader IT processes or monitoring features.
  • Vendor-neutral integration hub: Perhaps one of the strongest arguments for IRM as its own category is integration breadth. Modern organizations typically use a heterogeneous set of tools: different monitoring systems (cloud provider monitors, application performance tools, etc.), logging and observability platforms, ITSM for ticketing, chat apps for collaboration, CI/CD pipelines, and more. An incident response platform needs to play nicely with all of them. If you rely on an incident feature inside one vendor’s platform, you might be limited in connecting to external tools. A standalone IRM platform is vendor-neutral by design, acting as a Switzerland that connects everything. For example, ilert deliberately does not compete with monitoring vendors; we focus on integrating with them. We even decided to discontinue our own built-in uptime monitoring feature so we could “maintain our vendor-neutral position” and avoid conflicts of interest with our monitoring partners. Being neutral ensures that the IRM system’s only goal is to reliably route alerts between all your systems and your people without bias toward where the data comes from.
  • Lightweight layer over existing tools: A dedicated IRM solution adds a thin but crucial layer on top of your existing infrastructure. It doesn’t replace your monitoring or your ticketing system. Instead, it makes them more effective by ensuring that alerts from the former get actionable response and by avoiding overload of the latter. In practice, many companies pair an IRM platform with their ITSM. For instance, you might continue managing incident records and compliance in ServiceNow but use ilert to handle the real-time paging and human coordination. The two systems complement each other: ServiceNow is excellent for structured ITIL workflows, while ilert serves as a dispatcher for critical alerts, integrating with over 100 monitoring, observability, ITSM and chat tools to trigger immediate action before a formal ticket is even filed. This kind of flexible orchestration is only possible when IRM is a separate, integrative layer rather than locked inside one of the tools.
  • Focus and innovation: Finally, keeping IRM as its own category fosters innovation. When a product’s sole mission is incident response, its team can iterate and improve on that problem faster than if incident features are just one item on a long list of priorities in a larger suite. The result is often more user-friendly on-call experiences, smarter alert routing (even leveraging AI for noise reduction or auto-remediation), and features like status pages and analytics that are deeply tuned to incident management needs. We’ve seen a wave of innovation from specialized IRM startups and platforms precisely because they are tackling this as a primary challenge, not a secondary feature.

Integration over competition: ilert’s vendor-neutral stance

One concrete example of treating IRM as a category is how we at ilert approach our product strategy. We believe an incident response platform should complement the rest of your toolchain, not compete with it. This philosophy is why we made the conscious choice to sunset our uptime monitoring offering. By stepping back from providing our own monitoring, we can fully embrace integrations with best-of-breed monitoring and observability tools used by our customers.

In our announcement about this change, we explained that discontinuing the feature allows us to maintain our vendor-neutral position for monitoring and avoid any potential conflicts of interest when engaging in partnerships with vendors of uptime monitoring software.

In other words, we never want ilert to favor one data source over another. Our job is to reliably route alerts from any source to the people who need to see them.

This vendor-neutral, integration-first approach has a big payoff for users: it means you can plug ilert into whatever systems you already have and trust that we’re focused solely on improving your incident response process. It’s the opposite of a walled garden. We’ve built 100+ integrations and even tailored our features to work hand-in-hand with systems like Jira, ServiceNow, Datadog, Amazon CloudWatch, Slack, Microsoft Teams, and so on. The feedback from former Opsgenie customers moving to ilert is that this openness and focus are exactly what they were looking for. They want their incident response platform to be an unbiased orchestrator, not pushing them to replace tools that already work well for them.

The IRM Platform as the Central Dispatcher

At its heart, an Incident Response Management platform is the central dispatcher between people and systems during an outage or critical event. Companies often have monitoring tools that detect issues and ticketing systems that record and assign work. But it’s the IRM platform that bridges the gap in real time, ensuring that when something breaks at 2 AM, the right on-call engineer’s phone rings, and the team can mobilize immediately. It coordinates humans (through alerts, escalations, and collaboration) in response to machine signals. 

This role is unique. If you try to handle it purely within a monitoring tool, you might get alerts out, but you miss the human workflow aspects (like escalations or communications across teams). If you try to handle it purely within an ITSM tool, you often sacrifice speed and simplicity (turning emergencies into tickets can introduce delay or bureaucracy).

The true measure of an IRM platform’s value is in how effectively it connects and accelerates your existing investments: your monitoring becomes more actionable, your on-call staff more effective, and your incident process more transparent. All of this happens without forcing you to change the tools you use for observability or ITSM. That’s why I see IRM as its own pillar in the tech stack—a mission control that sits alongside observability and ITSM, not inside them.

Closing thoughts

The question of “category or feature?” is a healthy one to revisit as platforms evolve. In the case of incident response management, my experience and recent customer discussions reinforce that it remains a category in its own right. 

The stakes during incidents are too high, the integrations needed are too many, and the workflows are too specialized for IRM to be an afterthought or merely a line-item feature. Instead, we should view IRM platforms as complementary partners to our monitoring, DevOps, and ITSM tools, each doing what they do best.

For ilert, this means continuing a calm and focused pursuit of being the best dispatcher of trust between all the systems that detect problems and all the people who solve them. We’ll integrate, orchestrate, and stay vendor-neutral, so our users can confidently rely on a platform that puts incident response first

In a world where everything from cloud services to ticketing systems is expanding in scope, there’s real value in something that deliberately stays specialized. Incident Response Management is this something—a standalone discipline and platform that ensures when things go wrong, they get fixed as fast as humanly (and technologically) possible.

Engineering

An ultimate step-by-step guide on Zabbix Cloud Monitoring

Learn how to set up Zabbix Cloud for AWS Auto-Discovery and receive critical alerts via SMS, phone calls, or push notifications.

Tim Nguyen Van
Mar 26, 2025 • 5 min read

Learn how to set up Zabbix Cloud for AWS Auto-Discovery and receive critical alerts via SMS, phone calls, or push notifications.

During the last Zabbix Summit, the company presented a cloud version of its well-known monitoring platform. We at ilert constantly see the growing popularity of Zabbix as more and more teams across the globe utilize it for their monitoring needs. To help users quickly adopt the new cloud version, we delivered this guide.

Why Zabbix 

Maintaining the functionality and health of cloud infrastructure, such as servers, virtual machines, databases, containers, and apps, is essential for companies of different sizes. Zabbix Cloud Monitoring is an effective instrument for keeping an eye on all these resources across well-known cloud providers like Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS).

Zabbix Cloud Monitoring gives businesses proactive alerting, automatic anomaly detection, and real-time insight into their cloud infrastructures. In contrast to conventional monitoring solutions, Zabbix combines agent-based and agentless techniques to monitor important performance indicators, identify problems before they become more serious, and guarantee peak system performance.

What this guide covers 

This step-by-step guide will help you:

  • Set up and configure Zabbix Cloud Monitoring for AWS Auto-Discovery;
  • Integrate cloud services using API-based monitoring for full visibility
  • Create dashboards to proactively manage your infrastructure;
  • Receive critical Zabbix alerts via multiple channels, like SMS, phone calls, messenger, or push notifications with the help of ilert.

Prerequisites: What you will need to follow this guide

  • A registered account on Zabbix Cloud;
  • A Zabbix Cloud instance deployed and accessible via a web browser;
  • AWS Account with API Access;
  • IAM (Identity and Access Management) Policy: you need to create an IAM policy for the Zabbix role in your AWS account with the necessary permissions;
  • CloudWatch Metrics: Ensure that CloudWatch metrics are enabled for your AWS resources, such as EC2 instances, RDS databases, and S3 buckets, to provide monitoring data.

Stage 1: Creating an IAM Policy for Zabbix

1. In AWS, open the IAM service and click “Policies.”

2. On the top right corner, click “Create policy.”

3. Select “JSON” and add the following configuration to the policy editor.

1{
2    "Version": "2012-10-17",
3    "Statement": [
4        {
5            "Action": [
6                "cloudwatch:Describe*",
7                "cloudwatch:Get*",
8                "cloudwatch:List*",
9                "ec2:Describe*",
10                "rds:Describe*",
11                "s3:ListAllMyBuckets",
12                "s3:GetBucketLocation"
13            ],
14            "Effect": "Allow",
15            "Resource": "*"
16        }
17    ]
18}

4. Enter a new name for the policy and click “Create policy.

5. Now, navigate to Users and click “Create user.”

6. Enter a user name and click “Next.”

7. Choose “Attack policies directly” in the Permission options and select the Zabbix policy.

8. Navigate to the created user and create a new access key.

9. Choose “Third-party service” and proceed to the next step.

10. An Access and a Secret access key have been created, which you will need in your Zabbix configuration.

Stage 2: Creating an AWS Discovery host in Zabbix Cloud

1. On the sidebar, navigate to “Data Collection” and select “Hosts.”

2. Enter a name for your Host, select AWS by HTTP as a template, add a Host group, and click “Add.”

3. Now, reopen the newly created Host and navigate to “Macros.” Add the following Macros: {$AWS.ACCESS.KEY.ID} {$AWS.REGION} {$AWS.SECRET.ACCESS.KEY} and fill the values with the Access key, Region, and the Secret access key.

4. Find the “Hosts” section in the “Monitoring” tab again; you can now see your hosts.

5. By clicking “Latest Data,” you can now see all the latest data received from your AWS EC2 instance.

Zabbix Dashboards

Zabbix Dashboards provide an easy-to-use interface for monitoring your infrastructure, including cloud environments. They give you an extensive overview of key metrics in one location, including database performance, storage usage, server health, and cloud resources.

Using Zabbix Dashboards for infrastructure and cloud monitoring, you can keep track of your resources more effectively. Key features of the dashboards are:

  • Customizable layout
  • Real-time monitoring
  • Various widget types (graphs, availability, status, maps, etc.)

Configuring Monitoring Dashboards

After setting up auto-discovery for AWS resources and integrating your AWS environment with Zabbix Cloud Monitoring, you may create monitoring dashboards to get complete insight into your cloud architecture.

1. Navigate to Dashboards and click “Create dashboard.”

2. Add a name and choose the owner of the new dashboard.

3. You can now add various widgets like graphs, maps, charts, availability status, and more to your dashboard.

Triggers and media types in Zabbix Cloud

Triggers and media types are essential for proactive monitoring. They enable you to automatically identify problems with your Cloud infrastructure, such as high CPU usage, low disk space, or service outages, and promptly alert you when it's crucial.

What are Triggers?

Triggers in Zabbix are expressions that evaluate the data gathered from monitored items (such as CPU usage, memory usage, disk space, etc.). When a predefined threshold is reached or exceeded, a trigger is activated.

Trigger examples:

  • CPU Utilization: A trigger could be set up to alert if an EC2 instance’s CPU usage exceeds 85% for more than 5 minutes.
  • Disk Space Usage: A trigger could be set to notify if an EC2 instance’s disk usage exceeds 90%.

Configuring Triggers

1. Navigate to “Monitoring,” then “Hosts.

2. Select the host for which you want to create a trigger and click “Triggers” under the “Configuration” section.

3. Now click “Create trigger.”

4. In this example, I’ll configure a CPU usage trigger.

5. After entering the trigger name, we can now add the expression. In this case, it will set the severity to “High” whenever the CPU usage is above 85% and will recover when the CPU usage falls below 80%.

What are Media types?

In Zabbix, Media types relate to the different options for receiving notifications or alerts when a trigger is active. A Media type specifies how and through which channels Zabbix will send notifications to users.

Zabbix supports a variety of media types, allowing you to customize alerting according to your preferences or requirements. Some common Media types include:

  • Email: Send notifications via email to alert users of any issues.
  • SMS: Send text messages (SMS) for mobile alerts.
  • Webhook: Trigger a custom action or integrate with third-party systems via webhooks.
  • Third-party integrations: Use external services or platforms, such as ilert, to route alerts to specific teams or applications, ensuring a smooth integration into your existing incident management processes.

Stage 3: Connect Zabbix with ilert using the ilert Media type

To connect Zabbix with ilert, create a new User in Zabbix and add ilert as a Media type.

Add the Integration key of your Zabbix alert source into the Send to field.

For further information, please refer to ilert's Zabbix Integration Guide.

Explore all
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Open Preferences
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.