How to Keep Observability Alive in Microservice Landscapes through OpenTelemetry
Observability, beyond its traditional scope of logging, monitoring, and tracing, can be intricately defined through the lens of incident response efficiency—specifically by examining the time it takes for teams to grasp the full context and background of a technical incident.
In an ideal world, every alert would signal a unique and critical issue. However, in reality, alerts often come in waves. Alert noise refers to the overwhelming volume of notifications that incident response teams receive, many of which may be redundant or irrelevant. This can lead to alert fatigue, where critical issues might be overlooked due to the sheer number of notifications.
Reducing alert noise can help your team:
1. Focus: By grouping similar alerts, teams can concentrate on resolving incidents instead of sifting through noise.
2. Efficiency: Fewer, more relevant alerts lead to quicker decision-making and faster incident resolution.
3. Reduce stress: A more manageable flow of alerts minimizes the risk of alert fatigue, where important issues might be overlooked due to overwhelming notification volume.
Alert Deduplication and Processing
Excessive alert noise, caused by multiple similar notifications, can overwhelm incident response teams. Rather than bombarding teams with notifications for every problem, deduplication merges these alerts into a single, actionable item. This process relies on the semantic similarity of events, meaning that it groups alerts that convey the same meaning, even if they differ in wording. ilert employs AI-driven techniques to compare alerts, merging those that are similar.
Understanding Embedding Models
Embedding models are the backbone of AI-driven alert deduplication. These models translate human language into numerical representations, or vectors, that capture the meaning of the text. By leveraging these vectors, systems can effectively compare and group related alerts, enabling more precise and meaningful deduplication that cuts through the noise.
Vector embeddings are mathematical representations of data in a high-dimensional space, where each piece of data — whether it's a word, sentence, or document — is represented as a point in this space. The magic of embeddings lies in their ability to position similar items close to each other, making it easier to identify and group related data. For example, embedding models can transform complex text, like an alert message, enabling the system to group and deduplicate alerts that convey the same information.
Implementing Alert Deduplication
1) Preprocessing Alerts
The first step in deduplication is preprocessing. This involves normalizing the format of incoming alerts and cleaning the data by removing irrelevant elements like timestamps and IDs. By doing this, you ensure that all alerts are comparable and ready for accurate deduplication.
2) Generating Text Embeddings
After preprocessing, each alert is transformed into a vector embedding using a pre-trained model like BERT or OpenAI. Vectors represent the meaning of the alerts, allowing for effective comparison and grouping during deduplication.
3) Implementing Deduplication Logic
Once alerts are vectorized, the system uses similarity measures such as cosine similarity to compare them. If two alerts are deemed similar enough—based on a predefined threshold—they are merged into a single alert. This threshold can be fine-tuned to balance the accuracy of deduplication.
4) Continuous Feedback and Optimization
A feedback loop is necessary because it enables operators to flag missing duplicates or false positives, allowing the system to constantly improve by modifying thresholds and fine-tuning the embedding models.
Key Considerations for Effective Deduplication
While embedding models are a powerful tool for deduplication, several key issues need to be addressed:
Which Model to Choose? The right choice of embedding model will determine how well your deduplication process works. Fine-tuned or domain-specific models are better able to capture the nuanced information of your alerts, improving the deduplication outcomes.
What Threshold is Optimal? Establishing the appropriate threshold is essential. When a threshold is set too low, different warnings may be mistakenly combined, while a threshold set too high may result in duplicates being missed. Finding the ideal balance requires ongoing testing and tweaking.
Reducing Noise with ilert
ilert AI offers a powerful solution for reducing alert noise through its advanced deduplication and alert management features. By integrating with your monitoring tools, ilert normalizes incoming alerts and uses AI-driven techniques to identify and merge duplicate notifications. This process significantly cuts down on the volume of alerts, allowing your team to focus on incident resolution.
With ilert, you can ensure that only the most relevant alerts reach your team, reducing the risk of missed critical issues and enhancing overall incident response efficiency.
Welcome to our detailed guide, which will help you incorporate your current ilert configurations for incident management into Terraform. Here, you will find a step-by-step tutorial to import your existing ilert resources to the Infrastructure as Code project and recommendations from our engineering team on best practices to maintain consistency across your infrastructure and incident management processes.
If you are yet to start incorporating IaC practices in your organization, we recommend beginning with this ilert Terraform provider overview.
What problem do we solve?
The most common case is when users start their journey with the ilert incident management platform through the user interface and incorporate their established setup into Infrastructure as Code practices later. The ilert UI is typically more user-friendly and intuitive, making it quicker for engineers to create and configure resources like alert policies or on-call schedules. For instance, when experimenting with different settings or making quick changes, using the UI is faster and more straightforward than writing and applying Terraform code. On the other hand, once a resource configuration is stable and well-understood, engineers might prefer to codify it in Terraform for better consistency, version control, and automation across different environments.
Even companies utilizing ilert with IaC practices for years might use a combination of ilert UI and Terraform based on factors like ease of use, the immediacy of needs, the complexity of resource management, and the team's experience level with Terraform. The hybrid approach allows flexibility during initial setup phases or when manual intervention is necessary, while Terraform is favored for long-term consistency and automation.
So, it's perfectly fine that not all your ilert resources are already a part of your Terraform project. However, importing existing resources into an IaC project might be a bit tricky. Common problems are duplicates (newly created resources) in Terraform instead of the import of existing ones or errors like
Error: Bad request: api respond with status code: 400, error code: ERROR, message: The email '[example@example.com]' is already used by user 1234567
Let's see how to import existing ilert resources smoothly and avoid these issues.
Step 1: Identify an ilert resource ID you want to import
Let's see how to import an alert source created in the ilert interface into Terraform. Start with identifying a unique ID. You can do it directly in the UI or by using the API.
Method 1: Via ilert UI
Log into your ilert account and navigate Alert sources in the top menu.
Find the alert source you want to import from the list or use a search field.
Click on the alert source's name to view its details. Then navigate the URL: https://example.ilert.com/source/view?id=1234567. Copy the numbers at the end; this is the ID you need.
Method 2: Via API
Ensure you have an API key that can be generated from your ilert account settings under the API section.
The API response will be in JSON format and include all alert sources' details, including their IDs. Look for the "id" field.
Step 2: Setup a Terraform block
In Terraform, a "block" refers to a section of code that defines a specific piece of configuration. Blocks are the building units in Terraform configuration files. Each block usually starts with a keyword that specifies what type of resource or setting you are configuring, followed by the details of that configuration enclosed in curly braces {}.
In your Terraform configuration file (e.g., main.tf), you would define a resource block for the alert source.
resource "ilert_alert_source" "example_alert_source" {
name = "Critical Server Alerts"
integration_type = "API"
escalation_policy_id = 1234 # Replace with the actual escalation policy ID
auto_resolve_timeout = 900 # Time in seconds before automatically resolving the alert
email_notification {
email = "alerts@example.com"
}
sms_notification {
phone_number = "+1234567890"
}
}
Step 3: Execute the Terraform Import
In your terminal or command line, navigate to the directory containing your Terraform configuration files. Then, execute the following command:
"ilert_alert_source.example_alert_source" refers to the Terraform resource you defined in your ".tf" file. Replace <ALERT_SOURCE_ID> with the actual ID of the alert source from ilert that you noted earlier.
Note that while in 99% of the cases, the import keys (identifiers) are the same as the entity’s ID, they sometimes might differ. You can find the import description at the bottom of each resource in the ilert Terraform provider's documentation.
Step 4: Complete the Configuration
After importing, Terraform knows about the existing alert source. However, the configuration file itself might not have all the details yet.
Execute a "terraform plan" to see what Terraform recognizes about the imported resource. This command will show you the current state of the resource compared to your configuration.
Based on the output of aterraform plan, update the resource block in your.tf file with the appropriate configurations.
Following these steps, you successfully import an existing ilert alert source into Terraform, enabling you to manage it as part of your Infrastructure as Code (IaC) setup. This process helps maintain consistency, allows easier updates, and integrates the alert source into your version-controlled infrastructure management.
Best practices and recommendations
Should I generate all the ilert entities via Terraform? How do other teams automate setting up and configuring their incident management workflows within IaC practices? We addressed these questions to ilert's CTO, Christian.
"It largely depends on the company structure. If you have a centralized Ops team that handles tasks such as user and team synchronization (via Terraform, API, SSO provisioning, etc.), then you could theoretically also consider having that team manage all other resources, especially policies, alert sources and alert actions.
However, we advise against this if there are independent responder teams. In our opinion, it’s best practice for team-relevant resources to be managed by the team members themselves. In larger organizations, we also see teams with DevOps skills who manage this individually through their own Terraform configurations, but some teams exclusively use the ilert UI.
In cases with responders who do not directly interact with the resources and only deal with alerts and incidents or in organizations where the number of alert sources remains manageable, we also see customers using complete Terraform configurations for all account resources.
There are even setups where Ops teams have fully automated the ilert onboarding, starting with assigning the correct escalation policy to alert sources based on, for example, Prometheus labels. Another example is DevOps teams submitting pull requests to the Terraform repository to provision themselves independently via GitHub actions.
Personally, I believe that centralized Ops teams also benefit when responders take responsibility for their own alert sources and services because this leads to greater engagement with the platform, which in turn results in better workflows and faster response times."
We are excited to add one more integration from the Industrial Internet of Things realm to our catalog! The seamless integration between ilert and Ubidots aims to streamline your operations, reduce machines' downtime, and improve overall efficiency.
What is Ubidots?
Ubidots is an innovative Internet of Things (IoT) platform that allows users to collect, analyze, and visualize data from their devices and sensors. It offers tools for building IoT applications, including real-time data visualization, cloud-based data storage, and advanced data analytics.
Users connect their hardware with the Ubidots platform using HTTP, MQTT, TCP, UDP, or by parsing custom/industrial protocols. The service works equally well for managing one or a thousand devices. Ubidots has various use cases. For example, with its help, companies track air and water quality, monitor the location and status of valuable assets, optimize the consumption and production of energy resources, and much more.
How Ubidots' Users Can Benefit from the Integration with ilert
Imagine a manufacturing company using Ubidots to monitor the performance of their machinery. Sensors placed on critical equipment collect data in real-time, providing insights into operating conditions and performance metrics. By connecting Ubidots with ilert, an alert is sent out immediately through multiple channels when an abnormal pattern is detected—such as a spike in temperature or vibration indicating a potential failure. An on-call technician receives the alert via SMS and phone call, ensuring they are aware of the issue even if their phone is on mute. The technician can then respond swiftly, checking the equipment and performing necessary maintenance before a failure occurs. After resolving the issue, the team can review the detailed post-incident report generated by ilert to understand the root cause and take steps to prevent future occurrences.
The integration between Ubidots and ilert enhances the manufacturers' ability to respond to incidents quickly and efficiently. Here are a few key features of the integration.
Actionable alerts. When there is an issue with a machine or sensor, users of ilert integration for Ubidots receive real-time actionable alerts. They can accept or reroute an alert to another engineer without logging into ilert.
Live dashboards. Active monitoring becomes an intuitive and simple task thanks to Ubidots' drag-n-drop dashboards and its broad offer of widgets. Bring your SCADAs to the cloud and be in touch with your operation from anywhere.
Automated on-call management. ilert's eliminates the manual effort and errors associated with managing on-call duties. The schedules are always at hand, and users will never miss on-call duty with automatic reminders.
Status pages. Ubidots' customers can easily update stakeholders on the status of machines via ilert private and public status pages. ilert status pages communicate incidents on auto-pilot, and there are various authentication options for access fine-tuning, like passwordless email login or whitelisted IP addresses.
AIOps. For those with hundreds of devices and who deal with large amounts of alerts regularly, ilert intelligent alert grouping and filtering can help reduce alert noise and better allocate engineering resources.
Integrating ilert with Ubidots brings a new level of efficiency and responsiveness to IoT monitoring and incident management. If you are new to Ubidots, start a free 30-day trial here.