BLOG

LLMs in production: optimising from multi-second to sub-second latency and getting 50x cost reductions for free

Table of contents

When youโ€™re dealing with a critical cloud infrastructure issue, every second counts. You need help fast, and you need it to be accurate. But even when youโ€™re not in a rush, you donโ€™t want to spend your precious time going through long forms asking to be filled out; you want to describe your problem and you want the rest to be taken care of.

The Challenge: making AI-assisted support snappier and more interactive

At DoiT, we haveย Avaย to assist you with your FinOps and cloud queries. Sometimes, though, you want to talk things through with aย human expert. This is when ourย Case IQ systemย helps customers provide the correct technical details when opening a case, ensuring our Customer Reliability Engineers (CREs) have everything they need to solve problems quickly.

The idea originated from our hackathon in summer 2024 and was built on OpenAI APIs. However, we decided to push the status quo and do better, focusing on the latency of our recommendations to the customer, making the system feel more snappy and interactive.

The Experiment: testing five models for latency, cost, and performance

To solve this, we designed a comprehensive 2-week experiment comparing our current model (OpenAIโ€™sย GPT-4o) against four alternatives:

  • GPT-4.1 miniย (OpenAIโ€™s newer, faster model)
  • Llama 3.1 8Bย (smaller, ultra-fast model on Groqโ€™s specialized hardware)
  • Llama 3.3 70Bย (larger, more capable model on Groq)
  • Llama 4 Scout 17Bย (preview model from Metaโ€™s latest family of models with promising capabilities)

The main aim was to find a model with lower latencies than the GPT-4o baseline. We expect to take a (small) hit on response quality to achieve that, and take any cost-cutting as a nice side-effect.

We tested these models across five tasks that Case IQ performs when you create an engagement:

  • Platform detection: What specific platform is the request related to
  • Product identification: Which specific cloud service needs help?
  • Severity assessment: How urgent is this problem?
  • Asset identification: Which project or account is affected?
  • Technical details extraction: What specific information do our engineers need?

Over two weeks, we processedย 21,517 traces across 755 real customer engagements, measuring latency, cost, and accuracy.

The technical foundation that made this comparison easy was our existingย LangChainย integration. Since we were already using LangChain for our GPT-4o implementation, adding the comparison models was straightforward: we addedย ChatGroqย calls alongside our existingย ChatOpenAIย integration, running them asynchronously to avoid impacting our production system.

We leveragedย LangSmithย for comprehensive instrumentation, automatically capturing latency measurements, token usage, error rates, and input/output logging across all traces.

The Results: increased speed for a small sacrifice in quality

The results exceeded our expectations:

โšก 4โ€“5x speed improvements

  • Platform detection:ย 571ms โ†’ 249msย (4.1x faster, using Llama 3.3 70B)
  • Product detection:ย 851ms โ†’ 406msย (2.1x faster, using Llama 3.1 8B)
  • Severity detection:ย 605ms โ†’ 126msย (2.6x faster, using Llama 3.3 70B)
  • Asset detection:ย 593ms โ†’ 220msย (2.7x faster, using Llama 3.3 70B)
  • Technical details extraction:ย 1,914ms โ†’ 334msย (5.7x faster, using Llama 3.1 8B)

๐Ÿ’ฐ Up to 50x cost reductions

While speed was our primary goal, the cost savings were remarkable โ€” some tasks became 50x cheaper to run while maintaining quality.

๐ŸŽฏ Keeping up performance

Through manual review of real customer engagements, we found that while GPT-4o scored 92โ€“96% accuracy, our faster alternatives maintained strong performance:

  • Llama 3.3 70B: 88โ€“96% accuracy with 2โ€“3x speed improvements
  • Llama 3.1 8B: 55โ€“88% accuracy with 4โ€“5x speed improvements

The Winning Strategy: A Hybrid Approach

Rather than picking a single โ€œbestโ€ model, we concluded that different models were needed for an overall optimal solution:

  • Llama 3.1 8Bย for product and technical detail detection (since these tasks have dependencies on each other, this is where speed matters most)
  • Llama 3.3 70Bย for platform, and severity detection, and asset identification (Llama 3.1 8B seemed to struggle with this task, although we believe there is room for optimisation through prompting)

The result?ย Total response time drops from over 3 seconds to under 1 second: aย 3โ€“4x overall speedup. Moreover, with this hybrid approach, we are expecting aย cost savings of ~93%ย on our total bill.

What this means for you

โšกย Near-Instant Responses: When you describe your cloud infrastructure problem, CaseIQ can now analyze it and request the correct technical details almost instantly.

๐Ÿ”„ย Real-Time Support Channels: These speed improvements unlock new possibilities. We are exploring putting support directly in Slack or other messaging platforms where our customers are already present.

๐Ÿš€ย Better First-Time Resolution: More accurate and complete problem descriptions for our specialists lead to faster response times and reduced back-and-forth.

Takeaways and whatโ€™s next

While the full technical details are fascinating (and availableย here), the key learning was twofold:

  • Strategic model selection works: Careful provider and model selection combined with smart architectural decisions can achieve dramatic latency improvements (3+ seconds to sub-second) while delivering massive cost reductions as a bonus.
  • Human evaluation is irreplaceable: While automated metrics provide useful baselines, manual review remains essential for understanding actual performance when dealing with text and humans; thereโ€™s always nuance that only people can properly evaluate.

At DoiT, we believe in being โ€œpowered by technology, perfected by people.โ€ These improvements ensure that when you need human expertise from our CREs, our AI has already done the groundwork to get you answers as quickly as possible.

โ€”

Want to experience the improved Case IQ for yourself?ย Talk to us todayย and see how we can help you.

Schedule a call with our team

You will receive a calendar invite to the email address provided below for a 15-minute call with one of our team members to discuss your needs.

You will be presented with date and time options on the next step