How We Shipped the Best Status Page Solution for Any Incident Management Scale
This blog post will uncover how ilert status pages work, the challenges we encountered while developing this feature, and the problem-solving approaches we adopted.
Backstory: Why we introduced status pages
ilert has long been a trusted platform for critical notifications, alerting, and escalation processes. In early 2021, we identified the need to broaden the scope of our offerings to better serve our customer’s needs. This led to a significant refinement in our approach: the separation of notifications into alerts and incidents.
Alerts are critical notifications aimed primarily at development and support teams. They relate to issues such as server anomalies, which may or may not directly impact the overall performance of the client’s systems. On the flip side, incidents signify more severe problems affecting the client's systems, often escalating to affect their end-users. This article has a detailed breakdown of how we arrived at this decision.
By categorizing problems as either alerts or incidents, we were able to tailor our response strategies efficiently. Furthermore, this differentiation logically pointed toward the adoption of status pages as a new core communication tool during incidents, ensuring transparency and up-to-date information sharing to all stakeholders involved.
In mid-2021, we embarked on envisioning the future of status pages. The guiding principles for developing these status pages were twofold: flawless technical execution and seamless native integration with our existing platform. The last one was the most challenging. Back then, there was no industry experience (and not that much even now) of combining an incident management platform and status pages as a natively integrated solution. If you look at the most well-known solutions on the market, like Atlassian Status pages and OpsGenie, those are separate products. We were aiming to combine two solutions as if they were one. And, of course, we wanted to make the status pages a transparent, easy-to-understand feature that enhances the usability of our platform for both our direct users and third-party entities.
Development
Our mission was to ensure lightning-fast performance for both our pages and APIs. Initially, we thought of CDN-based state pages to optimize the rendering process. However, as the development progressed, we faced several challenges with this approach, including dynamic certificate generation for custom domains, speedy status page updates, mutual authentication for private pages, and dynamic content adaptation based on user engagement, among other constraints.
These challenges led us to shift towards a Server-Side Rendering (SSR) approach, using a multilayer caching strategy. We explored various off-the-shelf solutions but found them unsuitable for our needs. So, we developed a custom SSR solution tailored to our requirements, allowing us ultimate control from the initial user request to the final pixel delivery. As a result, we have the ideal performance on both the desktop and mobile devices.
One of the biggest difficulties was the process of preparing the data before rendering the page. We had to completely separate the data needed for the status page microservices from the underlying data, such as the data that the user changes in the management UI.
To achieve this, we deploy a specialized microservice tasked with monitoring any modifications tied to the status page, including updates to properties, incidents, maintenance windows, and services. This microservice receives the necessary data through events from the ilert platform core. Subsequently, this data is transformed into what we refer to as an "invalidation event" and is then dispatched to a dedicated message queue for handling status page updates.
Another microservice dedicated to processing these updates consumes the messages from the queue. It handles the data storage into a long-term database and updates the cache store accordingly. The processing microservice operates continuously, pulling new invalidation events from the event queue as they arrive, ensuring that the information remains current and accurate. This invalidation event structure allows our system to swiftly render and display up-to-date status page information with minimal latency.
This approach also allows the logical components of the platform to be physically separated at the data storage layer, allowing the status page microservices to operate without delay or interruption, even if the main database is experiencing performance issues for whatever reason.
Mobile performance
Desktop performance
Microservices of status pages
To structure the feature's complexity, we divided status pages into microservices: Gateway, Content Renderer, Content Updater, Certificate Updater and Background Cache Runner.
Gateway. This microservice initiates the process upon each request, locating the status page via the domain name input from the user's browser. It assesses page type and user permissions based on pre-configured settings in the ilert management UI.
Content Renderer. Intervenes once the Gateway authorizes access. It first checks if a pre-rendered page is available in the cache. Public pages or private pages without specific configurations are cached under their domain. Audience-specific pages, however, are individually cached to cater to unique user access restrictions. If a cached page is available, it is instantly delivered. If not, the renderer attempts to quickly generate and cache the page on-the-fly. Should this process exceed time limits, a basic pre-rendered page layout is sent to the user’s browser, which completes the rendering locally, displaying a skeleton loader briefly as the page assembles; this is especially helpful for our larger enterprise customers who sometimes need support to create huge amounts of relationships in their resources.
Content Updater. The microservice processes content updates by first receiving requests from the gateway. It then collaborates with the "Content Renderer" to convert data into an HTML page, which it sends back to the gateway for user display. Additionally, it caches the HTML to speed up future requests, ensuring quick and responsive access to updated content for users.
Background Cache Runner. It signals missing caches to generate the page in the background, regardless of the generation time, ensuring it is ready for future requests in milliseconds. It also updates pages in response to any change from the management UI or related components like services, incidents, or maintenance windows, keeping the status pages up-to-date.
Certificate Updater. This microservice dynamically manages the security of custom domain status pages by consuming events from the Certificate Manager. Upon receiving these events, it automatically updates information related to SSL certificates, ensuring the status pages always operate with optimal security and compliance.
Infrastructure
Our infrastructure is designed around the versatility and reliability of Kubernetes, which orchestrates the deployment of our stateless and stateful microservices. Whether scaling up during peak demands or ensuring fault tolerance, Kubernetes provides the robust backbone needed for uninterrupted service. We use Redis for both caching status pages and facilitating rapid communication between services. By configuring multiple Redis databases, we optimize these processes separately and efficiently to fit the demand of our services.
Page Caching. Redis caches the rendered status pages, allowing them to be retrieved quickly for subsequent requests without re-rendering.
Service Communication. Microservices of our system, such as the Content Renderer, Background Cache Runner, and Certificate Updater, communicate through events using message queues, enhancing fault tolerance and scalability. This setup allows services to operate independently, ensuring system integrity and responsiveness even if one service fails, and enables flexible scaling based on traffic or data demands. It also increases resource efficiency by deduplicating high frequency redundant updates to status pages.
We utilize Redis exclusively for caching purposes across our platform, configuring separate cache databases for almost each microservice like the Gateway, Content Renderer, and Background Cache Runner. This division ensures that each microservice operates independently, maintaining its own cache to guarantee fast, reliable access to data without interference from other services, while we can reduce Cloud costs by scaling instances relevant to their workload. NGINX serves as the reverse proxy and load balancer, efficiently directing user requests to available resources and enhancing security protocols, such as SSL/TLS termination for HTTPS connections.
In terms of security, especially for custom domains on status pages, we employ CertManager within our Kubernetes clusters to automate the management and issuance of TLS certificates, streamlining the security operations without manual interventions.
So, let's put it all together in a rough diagram.
This architecture guarantees that every component is stateless and scalable across multiple instances or geographic locations, wherever Kubernetes can be deployed. The agility that Kubernetes, Redis, and NGINX provide in our setup ensures that we can serve users efficiently and maintain high availability and reliability across ilert.
ilert status pages today
In April 2022, we made ilert status pages available for all our customers. As we continue to innovate and improve, our Kubernetes clusters are now operational in multiple key regions, with plans for further expansion. We also introduced a new type—audience-specific-status pages, and brand new authentication options for our private pages.
ilert's built-in status pages within the incident management platform are inherently more reliable and robust than standalone solutions because they integrate with existing workflows, ensuring real-time synchronization of incident updates. Unlike separate tools that rely on external APIs or manual processes, an integrated status page automatically reflects the current status of the systems without delay, reducing the risk of outdated or incorrect information being displayed. Additionally, this tight integration simplifies maintenance, eliminates compatibility issues, and enhances data security by avoiding sharing sensitive information with third-party platforms.