AI-first technology for modern teams with fast response times
ilert is the AI-first incident management platform designed from the ground up as a single application and covers the entire incident response lifecycle.
Share your scheduling needs in a simple, chat-like interface. Add team members, rotation rules, and timeframes — and get a ready-to-use on-call calendar everyone can access.
Let AI take the call
Introducing the ilert AI Voice Agent—your first responder for calls, gathering key details and informing your on-call engineers.
Status updates in no time
ilert AI analyzes your system and incidents, offering quick updates and managing communications for efficient issue resolution.
ilert Responder – your real-time incident advisor
ilert Responder is an intelligent agent that analyzes incidents in real time. It connects to your observability stack, investigates alerts across systems, and surfaces actionable insights, without taking control away from your team.
Features
Analyze logs, metrics, and recent changes autonomously
Identify root causes and similar past incidents
Suggest responders, rollback paths, or related service
Ask questions in natural language and get direct, evidence-backed answers
Integrations
Get started immediately using our integrations
ilert seamlessly connects with your tools using our pre-built integrations or via email. ilert integrates with monitoring, ticketing, chat, and collaboration tools.
See how industry leaders achieve 99.9% uptime with ilert
Organizations worldwide trust ilert to streamline incident management, enhance reliability, and minimize downtime. Read what our customers have to say about their experience with our platform.
As you know, we've introduced a major update in recent months – ilert Responder – the AI Agent that helps you run root cause analysis during incidents and provides recommendations toward faster resolution. That's not all, and there are way more powerful features to share with you. Feel free to reach out to us via chat or at support@ilert.com if you have questions or if you want to propose a feature or improvement.
New Alert view: Built for real-time collaboration and AI assistance
To better support real-time collaboration and prepare for the next round of AI features, we introduced a revamped alert view. There are various collapsible sections displayed, allowing you to open only those that are important to you at the moment. The platform automatically opens the ones that are likely important to you by default. Apart from the ‘Alert details,’ ‘Deployment events,’ and ‘Incident communications,’ which are long familiar to you, you will notice the ‘Actions’ section with the list of recommendations from the ilert Responder and ‘Logs and data’ relevant to the received alert.
On the right side, you will see that the timeline now shares space with the chat which capabilities are also significantly enhanced. You can use threads to keep communication clean, tag colleagues, and leave emojis. And, most importantly, you can communicate with ilert AI in the same environment by simply mentioning it via @. Moreover, ilert chat mirrors the communication happening in the war room in Microsoft Teams. This new view brings alerts, context, and collaboration into one place, helping teams make faster and informed decisions in the heat of an incident.
Event Flows: Smarter routing for incoming events
With Event Flows, ilert introduces a powerful and flexible way to process incoming events before they are converted into alerts. The feature allows you to build dynamic, rule-based workflows that determine how events are handled, routed, or filtered – all through a simple visual interface.
This makes Event Flows perfect for organizations that deal with a large volume of alerts or operate across multiple teams. Instead of manually managing routing rules across alert sources, you can centralize your logic in one reusable flow. Whether you want to send database-related events to your DB ops team, ignore low-severity alerts outside of business hours, or escalate critical alerts directly to on-call responders, Event Flows give you the tools to do just that.
At the core of every Event Flow is the Incoming Event block. You can connect it to one or multiple integrations or custom event sources using ilert's Event API. Once connected, you gain full control over how these events should behave. For example, you can add conditional branches that inspect event content, such as custom fields, labels, or summaries, and direct them down different paths depending on the logic you define.
You can also integrate Support hours checks into your workflows, ensuring that notifications respect team availability. If no conditions match, a default "else" path ensures that the event still continues downstream without being lost.
Built with teams in mind, Event Flows can be assigned to one or more teams in ilert, making them easily reusable and manageable across larger organizations.
If you have suggestions for other nodes, don't hesitate to contact our support team or submit your idea in the ilert Roadmap.
Smarter insights with Reports 2.0
Check out the refreshed experience for all Reports, including Notifications and On-call reports. With a sleek design and enhanced filtering options, you can now quickly break down notifications and on-call activities by user, team, or custom time periods – helping you detect patterns and gain clarity.
The updated On-call reports show detailed logs of shifts, including time spent on each alert. Here, you also have more filtering options to fine-tune reports to various needs and audiences. This update enables better compensation tracking and fairness across teams. With Reports 2.0, ilert gives you deeper visibility into alert fatigue, delivery success, and overall incident response performance.
Overlay public holidays directly in your on-call schedules
Creating one-time schedule overrides just got easier. With the new holiday calendar overlay, ilert now displays relevant national holidays directly within the on-call schedule detail view. This removes the need to check external calendars and reduces setup errors. Simply spot holiday conflicts at a glance and create overrides with fewer clicks, improving coverage and reducing time spent managing schedules. You will probably also notice an overall elevated view of on-call schedules, as we overhauled its design.
‘Undo’ and ‘Regenerate’ options in AI-assisted incident communication
Managing incidents with AI just got more flexible. The latest ilert update enhances the AI-assisted incident comms workflow by giving users more control over the generated content. Now, when you press ‘Generate,’ ilert creates the incident summary and message based on your input and automatically displays a preview. Once generation completes, the Generate button transforms into a menu with two new actions:
Undo: Reverts back to your previously entered summary and message.
Regenerate: Creates a new version of the incident text based on your latest changes.
This allows for fast iteration without losing your original input, saving time and reducing errors in high-pressure moments. Additionally, the notification preview box at the bottom of the screen now clearly shows which status pages the incident will be posted on and how many subscribers will be notified. This ensures full visibility before you click ‘Create new incident’.
A few more improvements
Bulk-link alerts to incidents from the alert list. Managing multiple alerts just became more efficient. The alert list page now supports bulk actions, allowing you to select multiple alerts and link them to a single incident in one go. This speeds up incident management, especially during larger outages or correlated alert storms, reducing manual work and ensuring better alert-to-incident traceability.
ilert now supports labels. Labels are key-value pairs that add structured context to alerts and events. Labels make it easier to filter, route, and analyze incidents based on relevant information. They’re fully integrated with ICL and ITL, allowing dynamic routing, filtering, and automation based on runtime context. While we started with the event API and alerts, we are looking forward to bringing new filter options to all entities across the board.
Even better heartbeats. To prevent misconfigurations, ilert now prompts you if you try to save a heartbeat without selecting an alert source, ensuring you don’t accidentally create silent monitors. Additionally, you can customize the message for heartbeat pings. You’ll also now see your current heartbeat monitor usage directly in the ‘Usage & limits’ section (top right corner of the screen, under a cog icon), giving you better visibility and control.
Alert actions are displayed in the ilert Event Explorer. The Event Explorer is a real-time view into alert activity, showing detailed logs for every event sent to ilert. With the latest update, alert actions are now fully visible within the Event Explorer.
Markdown support for maintenance windows. You can now use Markdown in maintenance window descriptions. Whether editing in the management UI or displaying on status pages, your formatting – like bullet points, links, or code snippets – is now fully supported, helping you communicate planned downtime more clearly and professionally.
Auto-accept alerts for connected calls in Call flows. ilert’s Create Alert node in call routing now supports auto-accepting alerts on successful call connections. When the “Accept alert on answer” option is enabled, the first responder who picks up the call automatically accepts the alert, speeding up ownership assignment and eliminating manual steps. This feature improves clarity and reduces lag during voice-based incident acknowledgement. It also allows copying a legacy call routing behaviour when migrating to call flows.
Integrations
Connectwise. Automatically turn ConnectWise service tickets into ilert alerts. Keep your operations team in sync with real-time updates and streamline incident workflows between ITSM and on-call responders.
Alibaba CloudMonitor. Forward alerts from Alibaba Cloud CloudMonitor directly into ilert. Ensure critical metrics and events from your cloud infrastructure trigger the right on-call actions without delay.
Teamcity. Receive build and deployment failure alerts from JetBrains TeamCity in ilert. Stay on top of CI/CD issues and route incidents to the right developers instantly.
LibreNMS. Send network monitoring alerts from LibreNMS to ilert. Enhance your incident response by bringing SNMP and performance data into your centralized alerting and on-call system.
AI already transforms how we detect, respond to, and resolve outages. Traditional workflows often force responders to switch between dashboards, shift through logs, and coordinate across fragmented channels under stress. This reactive, manual approach leads to slower resolution, higher operational costs, and burnout, especially as IT systems grow more complex.
At ilert, we are not just discussing the future of incident management – we are actively building it. We have brought agentic incident response into production, enabling operational excellence while reducing manual toil and cognitive load for on-call teams. Here is how we made this vision a reality.
Building the foundation: Hive and the ilert AI voice agent
Our journey into agentic incident response began with architectural decisions prioritising flexibility, scalability, and intelligent action across all stages of the incident lifecycle.
Hive: Our LLM orchestration layer
Hive is our proprietary proxy and orchestration layer for large language models (LLMs). It powers intelligent incident summaries, contextual recommendations, and advanced workflows across ilert, enabling us to manage multiple model providers, optimise workload routing, and ensure a secure, consistent, and high-performance AI backbone for all use cases.
Hive allows us to seamlessly integrate new LLMs as they emerge, control cost efficiency by routing tasks to the best-fit model, and maintain data privacy while delivering highly contextual intelligence in real time.
AI voice agent for seamless responder interaction
Communication is critical during incidents, especially when responders need to act without being tethered to keyboards. Our AI voice agent enables responders to gather updates or report incidents verbally, integrating into existing call flows as a natural part of the process. It transforms voice interactions into structured, actionable alerts while synthesising updates from diverse data sources, bridging human intuition with automated data-driven action.
What is MCP (Model Context Protocol)?
The Model Context Protocol (MCP) is a dynamic, real-time protocol built by Anthropic that connects your data to the ilert Responder, providing the rich, structured context our agents need to act intelligently during incidents.
Why did we build MCP?
Traditional integrations often leave systems disconnected, requiring manual correlation across telemetry, logs, and infrastructure state during incidents. MCP was designed to eliminate these silos by automatically aggregating, structuring, and transmitting incident-relevant context in real time.
How does MCP work?
MCP gathers data from your monitoring systems, log aggregators, deployment pipelines, and infrastructure environments, processes it within a secure, EU-compliant, multi-tenant architecture, and delivers only the necessary data to our agentic responders. By doing so, MCP:
Ensures your agent has real-time, granular awareness of incidents;
Maintains strict data security, isolation, and compliance;
Reduces manual correlation and cognitive load during critical moments;
Enables low-latency, context-rich interactions with the ilert Responder.
Think of MCP as the neural network that links your observability stack, code repositories, and infrastructure directly to our AI systems, ensuring that decisions and suggestions are always contextually accurate, actionable, and relevant.
The ilert Responder pipeline: From alert to agent-proposed actions
We designed an end-to-end pipeline that transforms monitoring signals into intelligent, actionable workflows to accelerate incident resolution.
Event Flow → Alert
ilert Event Flow ingests monitoring signals and applies your rules and thresholds to trigger alerts when specific conditions are met. This ensures the right teams are notified the moment an incident requires attention, without unnecessary noise.
MCP (Model Context Protocol) comes into play
Immediately upon alert generation, MCP retrieves and structures relevant telemetry data, logs, recent deployment changes, and infrastructure status, delivering it securely to the ilert Responder. This ensures the Responder has comprehensive situational awareness, eliminating the manual task of gathering context during incidents. This is possible through context-aware integrations with
Observability tools: To pull telemetry and time-series data from Prometheus and Grafana;
Code repositories: To access commit history and deployment metadata from GitHub;
Infrastructure environments: To gain real-time status and configurations from Kubernetes.
ilert Responder proposes actions
The ilert Responder ingests and correlates data in real time, becoming an intelligent participant in incident response rather than a passive notification system. Leveraging its deep, contextual understanding, the ilert Responder formulates actionable recommendations such as:
Root-cause suggestions,
Step-by-step remediation instructions,
Escalation paths and dependency insights.
These are presented within the ilert chat interface, allowing responders to review, approve, or modify actions for safe execution during live incidents. The interactive chat UI evolves into a command centre, enabling responders to:
Request deeper insights dynamically,
Perform direct actions like scaling Kubernetes pods,
Drill down into suggested root causes and metrics seamlessly.
Operational improvements
Agentic incident response at ilert is delivering tangible results for engineering and operations teams:
Real-time log correlation and root cause inference to pinpoint likely causes within moments;
Diagnostic summaries providing human-readable, actionable overviews of incidents;
Interactive natural language Q&A with the agent for fast data retrieval and contextual clarity;
Actionable remediation proposals with direct, safe execution workflows;
Automated post-mortems and timelines to reduce manual documentation effort post-incident.
By reducing manual toil and accelerating clarity, teams are spending less time managing incidents and more time focusing on delivering reliable services.
Key learnings and best practices
Building and operating agentic systems for mission-critical incident management at ilert has taught us:
Trust through transparency: Autonomous data collection, correlation, and safe, pre-approved actions happen without manual steps, ensuring speed and reducing cognitive load for responders. For actions with higher risk or business impact, teams can choose to add approval steps if desired. Full transparency into what the agent is doing and why builds trust, enabling responders to understand and oversee agentic actions without slowing down resolution.
Guarding against hallucinations: Rich, structured, and verified context via MCP ensures the agent works with coherent, reliable information, significantly reducing the risk of inaccurate suggestions or actions.
Performance tuning for low latency: Incident response is time-critical. Through speculative tool calls and optimised data pathways, we ensure that insights and actions are generated in near real-time, reducing MTTR when every second counts. Continuous learning: Feedback loops integrated into workflows help our agent refine its recommendations and actions over time, improving accuracy and effectiveness with every incident.
Safe autonomous execution: By defining safe, controlled scopes for automated remediation, the agent can execute corrective actions independently where appropriate, accelerating resolution while retaining operational safety and rollback capabilities.
Conclusion: Agentic incident response is already here
At ilert, we believe that the era of manual, reactive incident management is ending, and the benefits of agentic automation are too significant to delay. We are proud to bring these advanced capabilities into production, reducing toil, cutting MTTR, and empowering teams to focus on what matters most: reliability and innovation.
While ilert Responder already automates data gathering, analysis, and remediation suggestions, this release is just the first milestone. Our next goal is to let ilert Responder resolve well-understood, low-risk incidents – like flaky health checks or transient latency spikes – entirely on its own. Human responders stay in control, but much of the routine toil will fade away.
Imagine it’s 2 AM and a critical system flatlines without warning. A bleary-eyed on-call engineer scrambles to restore service, shielding customers from a major outage that could torpedo your next Service Level Objective (SLO) review. Yet when daylight returns, debates over fair on-call compensation start all over again: What’s “just” pay for sleepless nights, unpredictable pings, and rapid-fire incident responses?
What counts as on-call?
On-call is a special working hour arrangement under employment law. It comes into effect when the employee is obliged to be contactable at least by phone, so they can start work in an emergency. On-call duty is generally counted as time specifically meant for work purposes.
In practice, this means that employees are normally not allowed to work while on call. However, there may be exceptions. For example, on-call employees may also work from home if they can be reached through their work device.
What's the difference between on-call and stand-by service?
There’s a time-and-location gap between the two models:
On-call – employees stay reachable (phone, pager, or on-call management app) and can log in from anywhere when an alert fires.
Stand-by – staff must be physically present on site and ready to act immediately. German labour law labels this Bereitschaftsdienst as working time and treats it accordingly.
In IT operations, remote on-call service is usually preferred because most incidents (code rollbacks, config tweaks) can be resolved over VPN. Stand-by still matters for latency-critical environments, for example, trading platforms or industrial control systems, where a technician must monitor hardware and intervene within seconds to meet strict service-level agreements.
Are on-call hours the same as work hours?
Whether on-call duty counts as working hours isn’t as clear-cut as it looks. Under most labour-law frameworks – including Occupational Safety and Health guidance and the U.S. FLSA Fact Sheet #22 – passive on-call time is treated as rest time as long as no alert comes in. The moment you’re paged and start troubleshooting, those minutes flip to active working time. In borderline cases, courts (e.g., Germany’s BAG, Oct 2023 ruling 6 AZR 210/22) decide which periods qualify, so definitions often vary by jurisdiction and company policy.
There’s also no universal rule on pay. Many employers treat on-call duty as billable work and compensate it accordingly; others classify passive standby as unpaid availability. If your firm uses the latter model, remember you won’t be reimbursed for simply being reachable.
Bottom line: on-call time isn’t always the same as working time – it hinges on the organisation’s compensation policy. Some U.S. big-tech companies (Airbnb, Apple, Netflix) don’t pay for passive standby, while many European tech firms do.
On-call duty times
On-call scheduling is usually confined to specific nights or weekends agreed in advance and written into the employment contract. Because fewer staff are on site during those hours, reliable night- and weekend coverage is essential.
In Germany, the ICT trade group Bitkom recommends capping on-call assignments at 56 days per calendar year and guaranteeing at least 8 consecutive hours of rest per shift – Bitkom’s guideline on Rufbereitschaft im IT-Betrieb. On-call duty is generally classified as non-working time, so the usual 11-hour rest break required by §5 (1) of the Arbeitszeitgesetz does not apply until the engineer has actively worked on an incident.
Need an easy way to keep those limits visible? ilert’s on-call scheduling shows every planned rotation and actual shifts at a glance, so teams stay compliant without spreadsheets.
How is payment settled for on-call service in IT companies?
In IT companies, on-call hours are usually considered working time and are paid as such. As mentioned above, be sure to clarify this with your employer in advance to check what is stated in your contract.
For large corporations like Airbnb or Apple, which do not pay for on-call time, the argument is that their employees are already among the top earners. This means that their employees still earn much more than they would at most companies that pay on-call time in addition to their salary.
In Germany, there is no specific law regarding how on-call hours should be paid. This is, therefore, left up to the employer’s discretion. Most commonly, however, on-call duty is generally paid working time, i.e., the employee receives payment for the time he or she is on-call. This can be structured in different ways.
In practice, on-call time is often compensated either on top of the standard hourly wage or with time off. In many companies, on-call time is also counted as working time and is paid for accordingly. However, this is only possible if the employee is working rather than being only available by phone. As already mentioned, this would be the case while working from home.
In most tech organisations, hours spent on-call count as paid working time, yet the formula changes from company to company. Before you join a rota, double-check your contract or the internal on-call compensation policy.
In practice, you’ll see two common models:
Hourly uplift
A percentage on top of the standard rate for every scheduled standby hour.
Time-off swap
Eight hours on-call earn four hours of paid leave.
Remember, only the minutes you actively work are universally classed as working time; simply being reachable may stay unpaid unless your company’s policy says otherwise.
How are on-call services paid in IT companies?
Pay still varies by company size, sector, and risk profile. The federal collective agreement for public employees (TVöD) specifies the following allowances in § 8 Abs. 3:
Stand-by shifts of 12 hours or longer
Weekdays (Mon–Fri): paid at 2 X the hourly rate for the entire day.
Weekends and public holidays: paid at 4 X the hourly rate for the entire day.
Shorter stand-by windows (under 12 h)
Earn an additional 12.5 % of your hourly rate for each hour on call.
For work in a large corporation or a successful start-up, you can expect to earn about €1,000 per week. At Zalando, the on-call compensation is roughly €1,050; at the start-up HelloFresh, €1,000; and at Amazon Germany, about €800. Several companies in the financial sector offer comparable rates, although exact amounts vary. Here are the stats provided by Pragmatic Engineer blog:
SumUp (Germany): €1,050 per week
N26 (Germany): €880 per week
Klarna (Europe): €500 per week
Mastercard (UK): £470 per week
PayPal (Germany): $350 per week
Wise (UK): £300 per week
Recent engineer forums and community posts add further reference points:
Google – Tier-1 SRE rota (five-minute response): paid for 40 minutes of every on-call hour outside office hours (66% of the base hourly rate). Tier-2 (30-minute response): 20 minutes per hour (33 %).
AWS (EU Tier-0 services) – 25% of base pay for each out-of-hours on-call hour, plus a half-day of paid time off for every Saturday or night-time page.
Beyond payment: safeguarding on-call well-being
Pay isn’t the only lever that matters. On-call duty disrupts normal sleep patterns and life outside work, so protecting responders’ well-being is critical. Your team will cope far better if you follow these five practices:
Set crystal-clear expectations for response windows and escalation paths.
Rotate shifts fairly with primary + secondary roles,use an automated on-call schedule so the rota is transparent.
Watch the workload: track pages per engineer and cap consecutive overnights with on-call reports.
Leverage tooling- alert deduplication and smart escalations in ilert’s on-call management cut noise and shorten time-to-sleep.
Provide regular training and support- run quarterly fire-drills or gamedays so responders stay confident under fire.
Quick summary
On-call duty in IT means being reachable outside normal hours to respond to incidents, usually remotely. It differs from standby service, which requires physical presence and is always counted as working time. Legally, on-call time isn’t always paid, only active incident response typically counts as working time. Compensation varies: some companies offer hourly uplifts or time-off swaps, while others, like Apple or Airbnb, don’t pay for passive standby. In Germany, Bitkom recommends no more than 56 on-call days per year with 8-hour rest shifts. Weekly stipends range from €800 to €1,050 at firms like Zalando, HelloFresh, and SumUp. To protect engineers, best practices include fair rotations, clear escalation paths, tooling to reduce alert noise, and regular training