Engineering an AI Proxy for ilert
Building an AI proxy for our AI features was one of the best decisions we made a year ago. In this article, we will share why and what challenges we faced.
Reasons why we created an AI proxy
Narrow, too narrow context
This journey began in 2023 when we only started implementing AI features into ilert. Back then, the capabilities of ChatGPT were impressive but still far from the capabilities available now, at the end of 2024. At that time, generative AI was just beginning to prove its value in business applications, with early adopters exploring potential use cases in customer service, content creation, and data analysis. The initial version of ChatGPT could deliver meaningful insights and streamline some workflows, but it had limitations in handling complex, domain, and context-specific queries.
The context was narrow — only 4000 tokens. Our first AI-backed feature was automatic incident communication creation. In short, this feature helps customers automatically create messaging for their status pages to inform users about IT incidents. We didn't want to limit ourselves to just an intelligently crafted announcement but to include information on affected services and automatically identify and change those services' status from operational to outage. Our clients could have thousands of services in their accounts, and only a few are affected by an incident that we had to identify automatically. We also have clients in China and India, where words used to name services can be long. So, 4000 tokens weren't enough.
Can I get my JSON back, please?
JSON-formatted data is commonly used for machine-to-machine communication. It ensures that data can be easily parsed, validated, and transmitted across our applications, maintaining consistency and reducing the likelihood of data handling errors. However, like many others, we encountered some challenges related to JSON handling with the early releases of GPT-3.
Those versions were designed primarily for conversational text generation rather than structured data output. This limitation meant that while ChatGPT could understand and generate JSON-like responses, it struggled with strict JSON format adherence. So, even if we could try and fit the query into 4000 tokens, the early models' responses would occasionally omit closing braces or add unexpected text, which disrupted downstream processes that required strictly valid JSON. Simply saying, the call was falling.
No agents and no functions
GPT agents, as we know them now, can break down complex problems into actionable steps, prioritize tasks, and even chain responses together to achieve a goal. Without these capabilities, we had to rely on static prompt engineering, where each interaction with the AI was isolated and required precise prompting to achieve even moderately complex outcomes. This absence made it challenging to build features that required decision-making based on prior context or that needed to adapt dynamically to user inputs. Taking AI-assisted on-call scheduling creation as an example, we feed context-specific data to receive a feasible and flexible calendar for further usage.
Functions enable the model to go beyond simple text generation by directly executing specific, pre-defined actions within a system. They enable the AI to interact with external systems or databases, retrieve or update data based on user input. Functions allow the AI to directly interact with ilert’s API, enabling tasks like retrieving ilert-related context data. This functionality transforms the AI from a passive responder into an active, task-oriented assistant that can autonomously handle complex workflows. Now. It's hard to believe, but there were no functions two years ago.
Last but not least, we wanted to use different LLM providers, like AWS Bedrock and Azure GPT4. As we have many customers in the EU, we couldn't limit ourselves to American OpenAI API only. The absence of native support for our operational requirements led us to the concept of an AI proxy, a middle layer to manage requests and responses across AI models and ensure each interaction met ilert's performance standards.
Problems we resolve with the custom proxy
Logging, monitoring, and saving
Instead of sending AI requests straight to OpenAI or other model providers (and paying tolls every time), we funnel everything through our custom AI proxy. This way, whether you’re preparing a message on an incident for your clients, setting up schedules, or assembling a post-mortem document, our AI requests go through a one-stop shop where we handle all the behind-the-scenes stuff—logging, monitoring, and, yes, keeping an eye on those precious tokens and costs.
By tracking token usage and other cost metrics, the AI proxy lets us avoid unpleasant surprises on the billing side, and even better, we capture everything that goes in and out. This means we can use the data to fine-tune our models, helping the AI improve with every interaction. We also log performance data for different model versions, enabling us to assess each model’s effectiveness, response times, and accuracy under real-world conditions. Additionally, we track performance data based on the feedback from our customers to specific use cases and their related models so we know if a use case performs better or worse on different LLM models.
A significant advantage of our AI proxy is that it enables us to switch between different large language models on the fly, which is critical for ilert’s European customers who prioritize data localization. Many of our clients require on-premise models or cloud solutions that meet stringent data residency requirements, such as AWS Bedrock operating within specific regions like Frankfurt or Stockholm. By storing conversation threads and session histories locally and only for the lifetime of the conversation in our thread store, we can dynamically reroute requests between providers like Azure’s GPT-4 and AWS Bedrock without losing context. Circuit breakers within the AI proxy monitor response times and model consistency, automatically rerouting traffic to maintain seamless user experience and reliability when models encounter high demand or slowdowns on specific providers.