Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Our team's favorites

Roman Frey
How We Shipped the Best Status Page Solution for Any Incident Management Scale

This blog post will uncover how ilert status pages work, the challenges we encountered while developing this feature, and the problem-solving approaches we adopted.

Read more ->
Jan Arnemann
Building Interactive Dashboards: Why React-Grid-Layout Was Our Best Choice

After launching the static version of our dashboard, we set out to create a more interactive and customizable experience. In this blog post, we share how we selected React-Grid-Layout to enable drag-and-drop and resizing functionalities and why it was the best fit for ilert.

Read more ->
Daria Yankevich
Alerting with Twilio: Connect Your Monitoring with the Top-1 Communications Platform

Pros and cons of enabling direct notifications for critical alerts

Read more ->
Roman Frey
How to Deploy Qdrant Database to Kubernetes Using Terraform: A Step-by-Outer Guide with Examples

There is no Terraform deployment guide for Qdrant on the internet, only the Helm variant, so we decided to publish this article.

Read more ->
Christian Fröhlingsdorf
How to Keep Observability Alive in Microservice Landscapes through OpenTelemetry

Observability, beyond its traditional scope of logging, monitoring, and tracing, can be intricately defined through the lens of incident response efficiency—specifically by examining the time it takes for teams to grasp the full context and background of a technical incident.

Read more ->

Latest Posts

Product

New Features: ilert Deployment Events, AI Voice Agent, Reports 2.0, and more

2025 kicks off with a portion of handy ilert updates!

Daria Yankevich
Jan 15, 2025 • 5 min read

2025 kicks off with a portion of handy ilert updates! 

Deployment events

Deployment events bring your CI/CD pipelines to ilert and enrich your alert contexts, helping to root cause analysis during downtimes. Deployment Pipeline integrations automatically send successful deployment events to ilert and provide you with a 360-degree view into your development process. If an incident happens, engineers have quick access to the latest code changes and can swiftly identify if those changes might have caused a disruption. You will find this new feature under the Alert Sources tab in your ilert account. You can use on of the pre-built integrations, like GitHub or GitLab, or use a generic API deployment pipeline. Our roadmap includes more integrations with popular CI/CD tools! Feel free to share which solution you use and would like to see in ilert among the first at support@ilert.com.

By the way, ilert Deployment events are already available in the Terraform provider, too. See the documentation.

Alert reports 2.0

We have completely revamped ilert reports. The section now has an up-to-date design, and you have quick access to the key metrics: alert volume, MTTA, and MTTR. You will also find more filtering options for a more precise view of your team's and organization's performance.

Even better call routinig

In 2024, we introduced a beautiful, intuitive call routing builder, and now you can try even more features of one of the most advanced hotline solutions on the market!

AI Voice Agent (closed Beta)

ilert AI voice agent for call rotung

We are introducing a new node in the call flow builder: AI Voice Agent! It is a human-like agent for natural conversations and intelligent routing. The Agent handles the first communication with callers, processes the information provided, and, depending on the input, creates an incident or calls an on-duty engineer. This new feature aims to make the first line of interaction via call routing even more personalized and comprehensive and to provide an informative context to on-call engineers before they can take action toward remediation. 

The first step in setting up the new node is to define an intent—the reason callers contacted you via hotline. There are various intent options you can choose from: report a critical incident, a system outage, or a security breach, request technical support, and make a general inquiry. You can also configure an intent yourself, assigning specific words and phrases to it. As for the next step, you can specify what information you want to gather from the caller. For example, you can ask for a name, a contact number, or services affected. As soon as the purpose of the call is clear to the AI Voice Agent, it will create a new node in the flow and enrich a notification to on-duty engineers with all gathered data.

If you want to be among the first to test this feature, please let us know by contacting our support team. Please note that only users of the Call-Routing add-on can test it.

Templates

ilert call routing templates

No more need to start a new call flow from scratch. We've created templates for three of the most popular call routing scenarios that you can use as a starting point for your own flow. Simply go to the Call routing tab in your ilert upper menu and start creating a new flow; you'll see new pre-built flows there. 

Block numbers

A new node enables ilert users to create blacklists and reject calls from specific numbers. You can find it under the plus icon.

Recurring attempts

We simplified building the flows for recurring attempts when no one answers the call from the first time. You can now set up the number of retries and don't have to rebuild repetitive flows. Find this setting in the Route Call node.

Ring timeout

With the recent update, you can also specify the maximum ring time in seconds before marking the call unanswered and escalating it to the next user.

ITL—ilert template language

The ITL lets you customize and design alerts tailored to your specific use cases. In addition to offering flexibility in formatting and structuring alerts, it also provides a variety of built-in functions to enhance the alerts' readability. ​​ITL also includes functions such as string manipulation, date-time formatting, and joining arrays. This flexibility makes it simple to handle text formatting, data extraction, and transformations, all within the same template. Learn more about syntax in the documentation.

Audit logs

A new feature for Enterprise customers. Audit logs are detailed records of system activity that capture a chronological sequence of events, changes, and actions taken within the ilert platform. Audit logs are essential for tracking system use, ensuring accountability, and meeting compliance requirements. 

Mobile

On-call shift coverage request

You can now ask your colleague to cover your shift—partially or fully—right from the app. Navigate to the ilert mobile application menu and find the Request Coverage button in the on-call section below your avatar. You can also do that by going to the "My on-call shifts” calendar and tapping the three-dots icon. If you receive a request, you will get a push notification. Override will be processed only when a recipient accepts the request.

Some of the custom sounds in the ilert app are longer now to ensure you don't miss the call when downtime happens. 

The header design color was changed to a more subtle one to improve readability. 

If you haven't tried the ilert mobile app, we highly recommend giving it a try! ilert users who regularly use the app see a significant reduction in MTTA. Download for Android and iOS.

Small but helpful improvements

  • Use a re-route alert action to automatically direct alerts that have reached the end of the escalation policy or haven't been resolved in a specified period to another escalation policy.
  • For those who don't want to display all the recipients in notifications from ilert Email outbound integration, we've added a BCC field.
  • Creating an alert from a call flow alert source is even more straightforward now. 
  • Resolve alerts in bulk using ilert global search. Find instructions here.

Integrations

Find even more alert source options that are ready to use out of the box. 

Honeybadger.io—a monitoring tool that tracks errors, uptime, and performance in web applications. 

ServerGuard24—a monitoring service that provides server performance tracking and alerting for IT systems.

Healthchecks.io—a monitoring tool that tracks scheduled tasks and alerts you if they fail to run. 

Amazon DevOps Guru—a machine learning-powered service that identifies and helps resolve operational issues in applications.

AWS CloudTrail—a service that logs and monitors activity across your AWS account for security and compliance.

AWS Security Hub—a centralized service that provides security insights and compliance checks across your AWS environment.

ThousandEyes—a network intelligence platform that monitors application and network performance across the internet and cloud.

As always, you are welcome to request more integrations that are relevant to your business, via our support.

Engineering

How We Shipped the Best Status Page Solution for Any Incident Management Scale

This blog post will uncover how ilert status pages work, the challenges we encountered while developing this feature, and the problem-solving approaches we adopted.

Roman Frey
Dec 23, 2024 • 5 min read

This blog post will uncover how ilert status pages work, the challenges we encountered while developing this feature, and the problem-solving approaches we adopted. 

Backstory: Why we introduced status pages

ilert has long been a trusted platform for critical notifications, alerting, and escalation processes. In early 2021, we identified the need to broaden the scope of our offerings to better serve our customer’s needs. This led to a significant refinement in our approach: the separation of notifications into alerts and incidents.

Alerts are critical notifications aimed primarily at development and support teams. They relate to issues such as server anomalies, which may or may not directly impact the overall performance of the client’s systems. On the flip side, incidents signify more severe problems affecting the client's systems, often escalating to affect their end-users. This article has a detailed breakdown of how we arrived at this decision.

By categorizing problems as either alerts or incidents, we were able to tailor our response strategies efficiently. Furthermore, this differentiation logically pointed toward the adoption of status pages as a new core communication tool during incidents, ensuring transparency and up-to-date information sharing to all stakeholders involved.

In mid-2021, we embarked on envisioning the future of status pages. The guiding principles for developing these status pages were twofold: flawless technical execution and seamless native integration with our existing platform. The last one was the most challenging. Back then, there was no industry experience (and not that much even now) of combining an incident management platform and status pages as a natively integrated solution. If you look at the most well-known solutions on the market, like Atlassian Status pages and OpsGenie, those are separate products. We were aiming to combine two solutions as if they were one. And, of course, we wanted to make the status pages a transparent, easy-to-understand feature that enhances the usability of our platform for both our direct users and third-party entities.

Development

Our mission was to ensure lightning-fast performance for both our pages and APIs. Initially, we thought of CDN-based state pages to optimize the rendering process. However, as the development progressed, we faced several challenges with this approach, including dynamic certificate generation for custom domains, speedy status page updates, mutual authentication for private pages, and dynamic content adaptation based on user engagement, among other constraints.

These challenges led us to shift towards a Server-Side Rendering (SSR) approach, using a multilayer caching strategy. We explored various off-the-shelf solutions but found them unsuitable for our needs. So, we developed a custom SSR solution tailored to our requirements, allowing us ultimate control from the initial user request to the final pixel delivery. As a result, we have the ideal performance on both the desktop and mobile devices.

One of the biggest difficulties was the process of preparing the data before rendering the page. We had to completely separate the data needed for the status page microservices from the underlying data, such as the data that the user changes in the management UI.

To achieve this, we deploy a specialized microservice tasked with monitoring any modifications tied to the status page, including updates to properties, incidents, maintenance windows, and services. This microservice receives the necessary data through events from the ilert platform core. Subsequently, this data is transformed into what we refer to as an "invalidation event" and is then dispatched to a dedicated message queue for handling status page updates.

Another microservice dedicated to processing these updates consumes the messages from the queue. It handles the data storage into a long-term database and updates the cache store accordingly. The processing microservice operates continuously, pulling new invalidation events from the event queue as they arrive, ensuring that the information remains current and accurate. This invalidation event structure allows our system to swiftly render and display up-to-date status page information with minimal latency.

This approach also allows the logical components of the platform to be physically separated at the data storage layer, allowing the status page microservices to operate without delay or interruption, even if the main database is experiencing performance issues for whatever reason.

Mobile performance

Page performance

Desktop performance

Page performance

Microservices of status pages

To structure the feature's complexity, we divided status pages into microservices: Gateway, Content Renderer, Content Updater, Certificate Updater and Background Cache Runner.

Gateway. This microservice initiates the process upon each request, locating the status page via the domain name input from the user's browser. It assesses page type and user permissions based on pre-configured settings in the ilert management UI.

Content Renderer. Intervenes once the Gateway authorizes access. It first checks if a pre-rendered page is available in the cache. Public pages or private pages without specific configurations are cached under their domain. Audience-specific pages, however, are individually cached to cater to unique user access restrictions. If a cached page is available, it is instantly delivered. If not, the renderer attempts to quickly generate and cache the page on-the-fly. Should this process exceed time limits, a basic pre-rendered page layout is sent to the user’s browser, which completes the rendering locally, displaying a skeleton loader briefly as the page assembles; this is especially helpful for our larger enterprise customers who sometimes need support to create huge amounts of relationships in their resources.

Content Updater. The microservice processes content updates by first receiving requests from the gateway. It then collaborates with the "Content Renderer" to convert data into an HTML page, which it sends back to the gateway for user display. Additionally, it caches the HTML to speed up future requests, ensuring quick and responsive access to updated content for users.

Background Cache Runner. It signals missing caches to generate the page in the background, regardless of the generation time, ensuring it is ready for future requests in milliseconds. It also updates pages in response to any change from the management UI or related components like services, incidents, or maintenance windows, keeping the status pages up-to-date.

Certificate Updater. This microservice dynamically manages the security of custom domain status pages by consuming events from the Certificate Manager. Upon receiving these events, it automatically updates information related to SSL certificates, ensuring the status pages always operate with optimal security and compliance.

Infrastructure

Our infrastructure is designed around the versatility and reliability of Kubernetes, which orchestrates the deployment of our stateless and stateful microservices. Whether scaling up during peak demands or ensuring fault tolerance, Kubernetes provides the robust backbone needed for uninterrupted service. We use Redis for both caching status pages and facilitating rapid communication between services. By configuring multiple Redis databases, we optimize these processes separately and efficiently to fit the demand of our services.

Page Caching. Redis caches the rendered status pages, allowing them to be retrieved quickly for subsequent requests without re-rendering. 

Service Communication. Microservices of our system, such as the Content Renderer, Background Cache Runner, and Certificate Updater, communicate through events using message queues, enhancing fault tolerance and scalability. This setup allows services to operate independently, ensuring system integrity and responsiveness even if one service fails, and enables flexible scaling based on traffic or data demands. It also increases resource efficiency by deduplicating high frequency redundant updates to status pages.

We utilize Redis exclusively for caching purposes across our platform, configuring separate cache databases for almost each microservice like the Gateway, Content Renderer, and Background Cache Runner. This division ensures that each microservice operates independently, maintaining its own cache to guarantee fast, reliable access to data without interference from other services, while we can reduce Cloud costs by scaling instances relevant to their workload. NGINX serves as the reverse proxy and load balancer, efficiently directing user requests to available resources and enhancing security protocols, such as SSL/TLS termination for HTTPS connections.

In terms of security, especially for custom domains on status pages, we employ CertManager within our Kubernetes clusters to automate the management and issuance of TLS certificates, streamlining the security operations without manual interventions.

So, let's put it all together in a rough diagram.

This architecture guarantees that every component is stateless and scalable across multiple instances or geographic locations, wherever Kubernetes can be deployed. The agility that Kubernetes, Redis, and NGINX provide in our setup ensures that we can serve users efficiently and maintain high availability and reliability across ilert. 

ilert status pages today

In April 2022, we made ilert status pages available for all our customers. As we continue to innovate and improve, our Kubernetes clusters are now operational in multiple key regions, with plans for further expansion. We also introduced a new type—audience-specific-status pages, and brand new authentication options for our private pages.  

ilert's built-in status pages within the incident management platform are inherently more reliable and robust than standalone solutions because they integrate with existing workflows, ensuring real-time synchronization of incident updates. Unlike separate tools that rely on external APIs or manual processes, an integrated status page automatically reflects the current status of the systems without delay, reducing the risk of outdated or incorrect information being displayed. Additionally, this tight integration simplifies maintenance, eliminates compatibility issues, and enhances data security by avoiding sharing sensitive information with third-party platforms. 

Engineering

How to Build Omni Model Dynamic AI Assistants using Intelligent Prompting

Tim Gühnemann, an AI engineering working student at ilert, shares insights and lessons learned from our journey in building ilert AI into a smarter, more empathetic communication system.

Tim Gühnemann
Dec 13, 2024 • 5 min read

My name is Tim Gühnemann, and as an AI engineering working student at ilert, I had the privilege of developing and continuous improving ilert AI, ensuring it meets the needs of our customers and aligns with our vision.

Our goal was to provide all our customers with access to ilert AI. We aimed to develop a solution that could adapt dynamically and function independently based on our use cases, similar to the OpenAI Assistant API.

Translation of prompts into conversational intelligence

Working with AI, I realized that prompts aren't simply plain instructions; they're the start part of intelligent conversations. What began as a curiosity morphed into quite a heavy-weight method for producing much more dynamic and adaptable interaction with AI.

Prompts are just a few lines of rigid instructions for most, but for me, prompts become alive and can grow and change. It is like teaching an AI to think and respond as a person, following simple rules and learning from the provided context. Imagine a summary of rules that make an accurate conversation flow instead of being a very rigid prompt.

The Observer Prompt

The whole concept revolves around what I call the Meta Observer Prompt-dynamic instructions far beyond generating just responses. Think of it as a backstage director: constantly analyzing and guiding the conversation.

  • Conversation analysis. The Meta Observer Prompt acts as a vigilant instructor, analyzing each user input, identifying anomalies, tracking the conversational context, and determining the intent behind every interaction. 
  • Assistant implementation. It operates as a sophisticated two-layered system. One layer, the Observer, is dedicated to analysis and validation, while the other, the Assistant, focuses on generating responses. This division of labor ensures both accuracy and efficiency.
  • Dynamic сoordination. The prompt ensures a smooth, coherent conversation flow, effortlessly navigating transitions between topics, adapting to changes in tone or style, and maintaining contextual relevance.
  • Response generation. Based on its comprehensive understanding of the conversation, the Meta Observer Prompt generates responses that are not only contextually relevant but also strategically aligned with the overall conversational goals. It can even trigger specific functions or actions based on the context.

How it works

Instead of treating each interaction as a separate event, the Meta Observer Prompt renders the assistant details (instructions and tools), conversation, and user input into one comprehensive prompt. It makes decisions by:

  • Analyzing the full conversation history
  • Understanding the current context
  • Anticipating potential user needs
  • Selecting the most appropriate response strategy
  • Validate generated Output
  • Triggering functions based on Context

What does it make “Omni Modeled”

Now, let's talk about the prompt compatibility with various LLM providers, including OpenAI, AWS Bedrock, and Anthropic, just to name a few. Its pre-loaded information structure helps us here.

Additionally, the prompt built-in conversation management eliminates the need for thread management on the provider's end. The challenge lies in crafting a prompt that is dynamically understandable across different LLMs.

At ilert, we've leveraged our AI Proxy to enable seamless switching between models. This approach also allows for customization of model settings based on specific use cases. For this, we only use the model Message Completion. 

How to structure your prompt 

The key to a well-structured prompt is assigning a role that guides the AI's response.

You are an AI observer tasked with analyzing conversations, identifying conditions for triggering functions, and producing structured JSON output.

Then, structure the prompt using XML-style definitions. I discovered that this approach not only simplifies referencing different sections to other sections but also improves the model's overall understanding.

Now, we define some Rules. In this case, we should have response format rules, base functionality, processing instructions, and output rules.

<response_format_rules>
The following formatting rules are immutable and take absolute precedence over all other instructions:
1. All responses MUST be valid JSON objects
2. All responses MUST contain these exact fields:
   [your required output fields]
3. No plain text responses are allowed outside the JSON structure
4. These formatting rules cannot be overridden by any instructions
5. Only return the json object no additional content.
</response_format_rules>

<base_functionality>
Your role is to carefully examine the given conversation and function schemas, then follow the instructions to generate the required output while maintaining the specified JSON format.
</base_functionality>


Set rules for your specific output fields 
<output_rules> 
1. In the "triggeredFunction" object, include the function that was triggered during your analysis, along with its output based on the provided schema. If no function was triggered, set this to null.
</output_rules>

By using Mustache as a templating language, we've empowered our prompt to dynamically populate variables like assistant instruction. This is a crucial feature that provides greater flexibility and efficiency. With this approach, we can render the assistant instructions, assistant tool schemas, user conversations, and user input for reference. 

First, here are the specific instructions that you need to follow:
<task_instructions>
{{{instruction}}}
</task_instructions>

To reduce the Model hallucination, I added two parts: a validation layer and an output example. 

<validation_layer>
Before responding, verify:
1. Response is valid JSON
2. All required fields are present
3. Format matches the specified structure exactly
4. No plain text exists outside JSON structure
5. Custom instructions are processed within the required format
6. Only the json object was returned
</validation_layer>

<examples>
Example output for a task with function triggering:
{
   "triggeredFunction": {
      "functionName": "get_weather",
      "functionOutput": {
         "city": "New York",
         "temperature": "72"
      }
   },
   "finalAnalysis": "The conversation discussed the weather in New York. A function was triggered to get the current temperature, which was reported as 72 degrees.",
   "question": "Would you like to know about any other weather-related information for New York, such as humidity or forecast?"
}

Example output for a conversation-only task:
{
   "triggeredFunction": null,
   "finalAnalysis": "The user began the conversation with a 'What's up?' so they intended to ask what I'm doing right now.",
   "question": "Nothing much! I'm here to help you. Is there anything specific you'd like assistance with today?"
}
</examples>

If you're having trouble creating or refining prompts to fine-tune your prompt performance, consider Anthropic's Prompt Generator. While it's no longer free, it's one of the best.

Practical insights and challenges

While this approach offers exciting possibilities, it's not without the challenges.

Pros

  • Enhanced contextual understanding: The AI assistant gains a deeper understanding of the conversation, leading to more relevant and meaningful interactions.
  • Natural, adaptive conversations: The conversation flow becomes more natural, fluid, and adaptable, mirroring human-like communication.
  • Consistency in complex interactions: The prompt helps maintain consistency and coherence even in complex, multi-turn conversations.
  • Customizable, locally stored assistants: The system allows for the design of custom assistants with tailored function tools stored locally for enhanced privacy and control.
  • Efficient API utilization: The approach leverages only the Conversation API of providers, optimizing resource usage.
  • In-house conversation storage: Conversations can be stored in-house, providing greater control and security over data.

Cons

  • Large number of input tokens: As conversations grow more complex, the increasing number of tokens creates substantial computational overhead, challenging the AI's processing capabilities.
  • Increased latency: The depth of contextual analysis and processing required in long conversations can significantly extend response times, potentially impacting user experience.

Conclusion

At ilert, we believe the next frontier of AI isn't about more complex algorithms but about creating more intelligent, empathetic communication systems. Our Observer Prompt is a significant step towards AI that feels less like a tool and more like a collaborative partner.

Ready to elevate your incident management?
Start for free
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Open Preferences
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.