This blog post will uncover how ilert status pages work, the challenges we encountered while developing this feature, and the problem-solving approaches we adopted.
After launching the static version of our dashboard, we set out to create a more interactive and customizable experience. In this blog post, we share how we selected React-Grid-Layout to enable drag-and-drop and resizing functionalities and why it was the best fit for ilert.
Observability, beyond its traditional scope of logging, monitoring, and tracing, can be intricately defined through the lens of incident response efficiency—specifically by examining the time it takes for teams to grasp the full context and background of a technical incident.
This blog post will uncover how ilert status pages work, the challenges we encountered while developing this feature, and the problem-solving approaches we adopted.
Backstory: Why we introduced status pages
ilert has long been a trusted platform for critical notifications, alerting, and escalation processes. In early 2021, we identified the need to broaden the scope of our offerings to better serve our customer’s needs. This led to a significant refinement in our approach: the separation of notifications into alerts and incidents.
Alerts are critical notifications aimed primarily at development and support teams. They relate to issues such as server anomalies, which may or may not directly impact the overall performance of the client’s systems. On the flip side, incidents signify more severe problems affecting the client's systems, often escalating to affect their end-users. This article has a detailed breakdown of how we arrived at this decision.
By categorizing problems as either alerts or incidents, we were able to tailor our response strategies efficiently. Furthermore, this differentiation logically pointed toward the adoption of status pages as a new core communication tool during incidents, ensuring transparency and up-to-date information sharing to all stakeholders involved.
In mid-2021, we embarked on envisioning the future of status pages. The guiding principles for developing these status pages were twofold: flawless technical execution and seamless native integration with our existing platform. The last one was the most challenging. Back then, there was no industry experience (and not that much even now) of combining an incident management platform and status pages as a natively integrated solution. If you look at the most well-known solutions on the market, like Atlassian Status pages and OpsGenie, those are separate products. We were aiming to combine two solutions as if they were one. And, of course, we wanted to make the status pages a transparent, easy-to-understand feature that enhances the usability of our platform for both our direct users and third-party entities.
Development
Our mission was to ensure lightning-fast performance for both our pages and APIs. Initially, we thought of CDN-based state pages to optimize the rendering process. However, as the development progressed, we faced several challenges with this approach, including dynamic certificate generation for custom domains, speedy status page updates, mutual authentication for private pages, and dynamic content adaptation based on user engagement, among other constraints.
These challenges led us to shift towards a Server-Side Rendering (SSR) approach, using a multilayer caching strategy. We explored various off-the-shelf solutions but found them unsuitable for our needs. So, we developed a custom SSR solution tailored to our requirements, allowing us ultimate control from the initial user request to the final pixel delivery. As a result, we have the ideal performance on both the desktop and mobile devices.
One of the biggest difficulties was the process of preparing the data before rendering the page. We had to completely separate the data needed for the status page microservices from the underlying data, such as the data that the user changes in the management UI.
To achieve this, we deploy a specialized microservice tasked with monitoring any modifications tied to the status page, including updates to properties, incidents, maintenance windows, and services. This microservice receives the necessary data through events from the ilert platform core. Subsequently, this data is transformed into what we refer to as an "invalidation event" and is then dispatched to a dedicated message queue for handling status page updates.
Another microservice dedicated to processing these updates consumes the messages from the queue. It handles the data storage into a long-term database and updates the cache store accordingly. The processing microservice operates continuously, pulling new invalidation events from the event queue as they arrive, ensuring that the information remains current and accurate. This invalidation event structure allows our system to swiftly render and display up-to-date status page information with minimal latency.
This approach also allows the logical components of the platform to be physically separated at the data storage layer, allowing the status page microservices to operate without delay or interruption, even if the main database is experiencing performance issues for whatever reason.
Mobile performance
Desktop performance
Microservices of status pages
To structure the feature's complexity, we divided status pages into microservices: Gateway, Content Renderer, Content Updater, Certificate Updater and Background Cache Runner.
Gateway. This microservice initiates the process upon each request, locating the status page via the domain name input from the user's browser. It assesses page type and user permissions based on pre-configured settings in the ilert management UI.
Content Renderer. Intervenes once the Gateway authorizes access. It first checks if a pre-rendered page is available in the cache. Public pages or private pages without specific configurations are cached under their domain. Audience-specific pages, however, are individually cached to cater to unique user access restrictions. If a cached page is available, it is instantly delivered. If not, the renderer attempts to quickly generate and cache the page on-the-fly. Should this process exceed time limits, a basic pre-rendered page layout is sent to the user’s browser, which completes the rendering locally, displaying a skeleton loader briefly as the page assembles; this is especially helpful for our larger enterprise customers who sometimes need support to create huge amounts of relationships in their resources.
Content Updater. The microservice processes content updates by first receiving requests from the gateway. It then collaborates with the "Content Renderer" to convert data into an HTML page, which it sends back to the gateway for user display. Additionally, it caches the HTML to speed up future requests, ensuring quick and responsive access to updated content for users.
Background Cache Runner. It signals missing caches to generate the page in the background, regardless of the generation time, ensuring it is ready for future requests in milliseconds. It also updates pages in response to any change from the management UI or related components like services, incidents, or maintenance windows, keeping the status pages up-to-date.
Certificate Updater. This microservice dynamically manages the security of custom domain status pages by consuming events from the Certificate Manager. Upon receiving these events, it automatically updates information related to SSL certificates, ensuring the status pages always operate with optimal security and compliance.
Infrastructure
Our infrastructure is designed around the versatility and reliability of Kubernetes, which orchestrates the deployment of our stateless and stateful microservices. Whether scaling up during peak demands or ensuring fault tolerance, Kubernetes provides the robust backbone needed for uninterrupted service. We use Redis for both caching status pages and facilitating rapid communication between services. By configuring multiple Redis databases, we optimize these processes separately and efficiently to fit the demand of our services.
Page Caching. Redis caches the rendered status pages, allowing them to be retrieved quickly for subsequent requests without re-rendering.
Service Communication. Microservices of our system, such as the Content Renderer, Background Cache Runner, and Certificate Updater, communicate through events using message queues, enhancing fault tolerance and scalability. This setup allows services to operate independently, ensuring system integrity and responsiveness even if one service fails, and enables flexible scaling based on traffic or data demands. It also increases resource efficiency by deduplicating high frequency redundant updates to status pages.
We utilize Redis exclusively for caching purposes across our platform, configuring separate cache databases for almost each microservice like the Gateway, Content Renderer, and Background Cache Runner. This division ensures that each microservice operates independently, maintaining its own cache to guarantee fast, reliable access to data without interference from other services, while we can reduce Cloud costs by scaling instances relevant to their workload. NGINX serves as the reverse proxy and load balancer, efficiently directing user requests to available resources and enhancing security protocols, such as SSL/TLS termination for HTTPS connections.
In terms of security, especially for custom domains on status pages, we employ CertManager within our Kubernetes clusters to automate the management and issuance of TLS certificates, streamlining the security operations without manual interventions.
So, let's put it all together in a rough diagram.
This architecture guarantees that every component is stateless and scalable across multiple instances or geographic locations, wherever Kubernetes can be deployed. The agility that Kubernetes, Redis, and NGINX provide in our setup ensures that we can serve users efficiently and maintain high availability and reliability across ilert.
ilert status pages today
In April 2022, we made ilert status pages available for all our customers. As we continue to innovate and improve, our Kubernetes clusters are now operational in multiple key regions, with plans for further expansion. We also introduced a new type—audience-specific-status pages, and brand new authentication options for our private pages.
ilert's built-in status pages within the incident management platform are inherently more reliable and robust than standalone solutions because they integrate with existing workflows, ensuring real-time synchronization of incident updates. Unlike separate tools that rely on external APIs or manual processes, an integrated status page automatically reflects the current status of the systems without delay, reducing the risk of outdated or incorrect information being displayed. Additionally, this tight integration simplifies maintenance, eliminates compatibility issues, and enhances data security by avoiding sharing sensitive information with third-party platforms.
My name is Tim Gühnemann, and as an AI engineering working student at ilert, I had the privilege of developing and continuous improving ilert AI, ensuring it meets the needs of our customers and aligns with our vision.
Our goal was to provide all our customers with access to ilert AI. We aimed to develop a solution that could adapt dynamically and function independently based on our use cases, similar to the OpenAI Assistant API.
Translation of prompts into conversational intelligence
Working with AI, I realized that prompts aren't simply plain instructions; they're the start part of intelligent conversations. What began as a curiosity morphed into quite a heavy-weight method for producing much more dynamic and adaptable interaction with AI.
Prompts are just a few lines of rigid instructions for most, but for me, prompts become alive and can grow and change. It is like teaching an AI to think and respond as a person, following simple rules and learning from the provided context. Imagine a summary of rules that make an accurate conversation flow instead of being a very rigid prompt.
The Observer Prompt
The whole concept revolves around what I call the Meta Observer Prompt-dynamic instructions far beyond generating just responses. Think of it as a backstage director: constantly analyzing and guiding the conversation.
Conversation analysis. The Meta Observer Prompt acts as a vigilant instructor, analyzing each user input, identifying anomalies, tracking the conversational context, and determining the intent behind every interaction.
Assistant implementation. It operates as a sophisticated two-layered system. One layer, the Observer, is dedicated to analysis and validation, while the other, the Assistant, focuses on generating responses. This division of labor ensures both accuracy and efficiency.
Dynamic сoordination. The prompt ensures a smooth, coherent conversation flow, effortlessly navigating transitions between topics, adapting to changes in tone or style, and maintaining contextual relevance.
Response generation. Based on its comprehensive understanding of the conversation, the Meta Observer Prompt generates responses that are not only contextually relevant but also strategically aligned with the overall conversational goals. It can even trigger specific functions or actions based on the context.
How it works
Instead of treating each interaction as a separate event, the Meta Observer Prompt renders the assistant details (instructions and tools), conversation, and user input into one comprehensive prompt. It makes decisions by:
Analyzing the full conversation history
Understanding the current context
Anticipating potential user needs
Selecting the most appropriate response strategy
Validate generated Output
Triggering functions based on Context
What does it make “Omni Modeled”
Now, let's talk about the prompt compatibility with various LLM providers, including OpenAI, AWS Bedrock, and Anthropic, just to name a few. Its pre-loaded information structure helps us here.
Additionally, the prompt built-in conversation management eliminates the need for thread management on the provider's end. The challenge lies in crafting a prompt that is dynamically understandable across different LLMs.
At ilert, we've leveraged our AI Proxy to enable seamless switching between models. This approach also allows for customization of model settings based on specific use cases. For this, we only use the model Message Completion.
How to structure your prompt
The key to a well-structured prompt is assigning a role that guides the AI's response.
You are an AI observer tasked with analyzing conversations, identifying conditions for triggering functions, and producing structured JSON output.
Then, structure the prompt using XML-style definitions. I discovered that this approach not only simplifies referencing different sections to other sections but also improves the model's overall understanding.
Now, we define some Rules. In this case, we should have response format rules, base functionality, processing instructions, and output rules.
<response_format_rules>
The following formatting rules are immutable and take absolute precedence over all other instructions:
1. All responses MUST be valid JSON objects
2. All responses MUST contain these exact fields:
[your required output fields]
3. No plain text responses are allowed outside the JSON structure
4. These formatting rules cannot be overridden by any instructions
5. Only return the json object no additional content.
</response_format_rules>
<base_functionality>Your role is to carefully examine the given conversation and function schemas, then follow the instructions to generate the required output while maintaining the specified JSON format.
</base_functionality>Set rules for your specific output fields
<output_rules>
1. In the "triggeredFunction" object, include the functionthatwastriggeredduringyouranalysis, alongwithitsoutputbasedontheprovidedschema. Ifnofunctionwastriggered, setthistonull.
</output_rules>
By using Mustache as a templating language, we've empowered our prompt to dynamically populate variables like assistant instruction. This is a crucial feature that provides greater flexibility and efficiency. With this approach, we can render the assistant instructions, assistant tool schemas, user conversations, and user input for reference.
First, here are the specific instructions that you need to follow:
<task_instructions>
{{{instruction}}}
</task_instructions>
To reduce the Model hallucination, I added two parts: a validation layer and an output example.
<validation_layer>
Before responding, verify:
1. Response is valid JSON
2. All required fields are present
3. Format matches the specified structure exactly
4. No plain text exists outside JSON structure
5. Custom instructions are processed within the required format
6. Only the json object was returned
</validation_layer>
<examples>
Example output for a task with function triggering:
{
"triggeredFunction": {
"functionName": "get_weather",
"functionOutput": {
"city": "New York",
"temperature": "72"
}
},
"finalAnalysis": "The conversation discussed the weather in New York. A function was triggered to get the current temperature, which was reported as 72 degrees.",
"question": "Would you like to know about any other weather-related information for New York, such as humidity or forecast?"
}
Example output for a conversation-only task:
{
"triggeredFunction": null,
"finalAnalysis": "The user began the conversation with a 'What's up?' so they intended to ask what I'm doing right now.",
"question": "Nothing much! I'm here to help you. Is there anything specific you'd like assistance with today?"
}
</examples>
If you're having trouble creating or refining prompts to fine-tune your prompt performance, consider Anthropic's Prompt Generator. While it's no longer free, it's one of the best.
Practical insights and challenges
While this approach offers exciting possibilities, it's not without the challenges.
Pros
Enhanced contextual understanding: The AI assistant gains a deeper understanding of the conversation, leading to more relevant and meaningful interactions.
Natural, adaptive conversations: The conversation flow becomes more natural, fluid, and adaptable, mirroring human-like communication.
Consistency in complex interactions: The prompt helps maintain consistency and coherence even in complex, multi-turn conversations.
Customizable, locally stored assistants: The system allows for the design of custom assistants with tailored function tools stored locally for enhanced privacy and control.
Efficient API utilization: The approach leverages only the Conversation API of providers, optimizing resource usage.
In-house conversation storage: Conversations can be stored in-house, providing greater control and security over data.
Cons
Large number of input tokens: As conversations grow more complex, the increasing number of tokens creates substantial computational overhead, challenging the AI's processing capabilities.
Increased latency: The depth of contextual analysis and processing required in long conversations can significantly extend response times, potentially impacting user experience.
Conclusion
At ilert, we believe the next frontier of AI isn't about more complex algorithms but about creating more intelligent, empathetic communication systems. Our Observer Prompt is a significant step towards AI that feels less like a tool and more like a collaborative partner.
At ilert, one of the key tools in our debugging process is the Event Explorer, which provides an extensive overview of incoming events and their processing lifecycle. By reflecting the event process of an alert source, the Event Explorer allows our team to trace event paths, correlate related data, and identify issues quickly. This type of debugging, focused on event transparency, helps us quickly investigate the root cause and resolve issues, ensuring the ilert platform's functionality, stability, and reliability.
In this article, I will explain more about the capabilities of Event Explorer.
The challenges of debugging without event transparency
Debugging in a large-scale system becomes significantly more challenging when system events are not fully transparent or easily accessible. In our platform, events can be spread across various components and systems, making it difficult to maintain a clear, unified view of what is happening. Some of the main difficulties include:
Fragmented Data. Event logs scattered across services make it hard to get the full picture.
Time-Consuming Correlation. Manually linking events slows down the troubleshooting process.
Missed Context. Without a unified view, important information could be missed, therefore complicating resolution.
We faced these challenges, particularly when customers reached out with specific edge cases related to alert sources that had never been considered before.
Event Explorer capabilities
The Event Explorer is available for all alert sources and shows what happened with incoming raw events as they were processed into alerts. We developed it to help customers gain precise clarity, troubleshoot event-related issues on our platform more efficiently, and empower our support team to assist effectively when customers reach out regarding unexplainable anomalies.
ilert Event Explorer returns full information about the incoming request, including event headers and payload. If an error occurs while processing, it displays the error information. If successful, the correlated converted event is displayed as an ilert alert. It also gives information about events being converted, for example, if it got appended due to alert grouping settings.
Here is a real-life scenario in which the Event Explorer came into action:
A customer contacted us because they hadn't received any notifications while testing our Nagios integration. When we asked for an alert ID to check our logs, they replied that no alerts had been created, which pointed to an issue in the event processing. Using the ilert Event Explorer, we discovered that the incoming request's payload missed the necessary keys and values for Nagios event conversion in ilert. It appeared that the enable_environment_macros macro in their Nagios configuration was disabled, preventing access to those variables. After enabling this macro, the customer started receiving alerts and notifications.
From request to Event Explorer: Tracing the journey of an Event
When an incoming request is sent to AWS ELB, a Lambda function validates the request and publishes a message to an SNS topic, which then delivers it to SQS queues. From there, another Lambda function consumes the message and stores the request information in Google BigQuery. Meanwhile, the event is processed by an EC2 instance, which converts it into an alert in ilert. The ilert Event Explorer then retrieves correlated request information from Google BigQuery.
Conclusion
At ilert, we believe that event transparency is important for simplifying debugging and improving system stability. One of our primary use cases for event transparency is the Event Explorer, which reflects the event processing of an alert source by offering detailed insight into how raw events are converted. The Event Explorer offers both our clients and us an overview of incoming events, enabling quick tracing, understanding, and resolution of anomalies.