Which LLM Is Right for Your Service Desk? A Practical AI for ESM
Category
Category
There's a question landing in more IT strategy meetings than ever before: which AI model should we actually be running?
It sounds like a technical decision; it has the right amount of acronyms, jargon and rapid evolution to imply technical complexity. In practice however, it's a business one - because the model you choose, and where you use it, shapes how much value your service management platform delivers and what it costs to run it. When the right decision has been made, AI accelerates your outputs, and becomes a force multiplier for your team. Get it wrong and you're either overpaying for capability you don't need, or underwhelming your users with a model that isn't up to the job.
This guide is for IT leaders navigating that decision. We'll cover how LLMs work in a service management context, which models are worth understanding today, how to match the right model to the right task, and why the most forward-thinking organisations aren't choosing a model - they're building a strategy.
A Large Language Model is the intelligence layer behind your AI virtual agent, knowledge management tools, and automated workflows. When a user asks why their access request hasn't been fulfilled, when your platform drafts a knowledge article from a resolved incident, or when your AI flags an emerging problem pattern before it becomes a P1 - an LLM is doing the reasoning.
What makes this complicated is that LLMs aren't interchangeable. Each one makes different trade-offs between speed, depth of reasoning, cost, and safety. A model that's brilliant at handling a complex, multi-source synthesis task might be completely overkill for answering a routine password reset query - and at ten times the price per interaction.
That mismatch, quietly running at scale across thousands of daily interactions, is where AI ESM deployments either prove their business case or quietly haemorrhage budget.
You don't need to track every model release. But a working understanding of the main players - and what they're actually good at - helps you ask better questions of your vendors.
Claude is the model family we most commonly recommend for service management contexts, and Sonnet 4.5 remains our current default for the majority of ESM workloads. It delivers reasoning quality that rivals flagship models from a year ago, handles large documents and long conversation histories well (its context window now runs to one million tokens), and is built on Anthropic's Constitutional AI framework - meaning its safety behaviour is consistent and predictable in ways that matter when the AI is talking directly to your employees.
Opus 4.5 is the step up for genuinely complex tasks. Where Sonnet handles the day-to-day comfortably, Opus earns its place when the job requires synthesising large volumes of unstructured data, producing polished publication-ready content, or orchestrating multi-step analysis that needs to hold a thread across a very long context. The trade-off is cost - Opus runs at a meaningfully higher price per interaction - so it makes sense to be selective about where you deploy it.
Notably, we're not actively recommending Anthropic's 4.6 Models at this time, as there seems to be some serious down-tuining of these models in the next release. This change really speaks to the need to be able to plug-and-play with different models and versions; if Anthropic (or any other provider!) make significant changes to the models, you should be able to pin a version and run it without fear.
A well-established model family with the broadest ecosystem integrations. GPT-5.4 performs strongly across a wide range of tasks and is a natural default for organisations already running deep in the Microsoft stack. Its tool-calling behaviour is reliable and it handles agentic workflows well.
We find that OpenAI's models perform well, but are notably slower than other offerings in market. The results are quite good across the board, with high-quality agent handoff and reasoning, but if speed is the name of the game, other models are superior.
DeepSeek has attracted significant attention for its price efficiency. Recent versions are dramatically cheaper than Western frontier models and can be self-hosted, which appeals to organisations with strict data sovereignty requirements.
That said, for customer-facing or employee-facing AI applications, the safety architecture is less comprehensive than what you get from Anthropic or OpenAI - a consideration that matters in regulated industries or anywhere model outputs carry real operational weight.
We find that DeepSeek's models are great for Organisations who are running their own hardware/compute for data-residency (or cool points!), and for Code-Generation.
Worth knowing about if EU data residency is a hard requirement. Mistral models are available through Azure EU regions and handle structured, well-defined tasks well. Less versatile for open-ended reasoning, but a credible option within those constraints.
Mistral has come leaps and bounds in the last year or two; if this trend continues, Mistral models will become a more compelling option for Microsoft Azure Customers.
No model dominates every category.
The more useful question isn't "which model is best?" - it's "which model is best for this specific job?"
This is where the practical guidance gets specific. Different AI ESM capabilities genuinely benefit from different model profiles.
What it involves: Answering employee questions, handling routine requests, providing status updates, resolving common issues through guided conversation.
Model fit: This is your highest-volume, lowest-complexity AI workload. Speed and availability matter more than raw reasoning depth. A mid-tier model like Claude Sonnet 4.5 handles this well, and at a cost that makes sense when you're processing thousands of interactions a week.
What it involves: Taking a resolved incident or problem record and producing a structured, accurate, publish-ready knowledge article for your service catalogue.
Model fit: The bar here is higher than virtual agent work. The AI needs to understand technical context, apply consistent structure, and produce output that a human knowledge manager would be comfortable putting their name to. A capable mid-tier model handles most of this well, but for technically complex or high-visibility articles, stepping up to Opus-tier reasoning produces noticeably better output.
What it involves: This is the highest-value use case in AI ESM - and the one most teams are working toward rather than running today. Imagine your platform looking across search queries, open tickets, problem records, CMDB data, and your existing knowledge base simultaneously, identifying a pattern that no individual analyst would have caught, and proactively surfacing a recommendation: a new knowledge article, a candidate problem record, or a projected threshold trigger.
Model fit: Frontier-tier reasoning. This kind of sustained, multi-source synthesis - holding a thread across disparate data streams and producing something genuinely useful from it - is where the gap between a mid-tier and a flagship model becomes visible. It's also exactly where the investment pays off, because the output is high-value and the interaction frequency is low.
What it involves: AI-driven routing, classification, escalation decisions, and the execution of multi-step workflows without human intervention at each stage.
Model fit: Depends on complexity. Simple classification and routing tasks run well on mid-tier models. Multi-step agentic workflows - where the AI is making a sequence of decisions and executing across systems - benefit from a model with stronger instruction-following and less hallucination risk, which typically means Claude Sonnet 4.5 or above.
What it involves: Code-generation, configuring the Servicely platform, reviewing and updating existing configuration, making recommendations on best-practices and alignment to OOTB principals.
Model fit: Frontier-tier reasoning and Coding is critical for good outcomes here. Servicely's architecture allows these models to gather reasoning and insights from existing platform configuration within your instance, so the best-practices and implementations done by your team are used to inform the outcome. In addition to these, larger context windows available in these models (along with advanced reasoining) allow the models to solve challenging problems like Component Generation and Script Library creation.
Here's the insight that's reshaping how serious AI ESM deployments are architected: the organisations getting the best results aren't picking a single model. They're routing different tasks to different models based on what each job actually requires.
A well-designed routing layer classifies each request and directs it to the most appropriate model. High-frequency, lower-complexity work goes to a fast, cost-efficient model. Synthesis and content generation routes to a mid-tier model. Complex cross-module analysis or high-stakes content reaches a frontier model. Industry data suggests this tiered approach can reduce AI infrastructure costs by 60–80% compared to routing everything through a flagship model - with no perceptible drop in quality for end users, because the right model is handling each job.
This is increasingly the standard architecture in production AI deployments. If your platform or vendor locks you into a single model, that's a meaningful constraint on your ability to optimise over time.
Servicely's AI Layer, Sofi, is built with model flexibility at it's core. Customers connect thier preferred LLM provider/s through simple configuration, and identify which model they want to use for various use cases within the platform; General Use Cases, Complex Reasoning, Coding and Fast/Immediate use cases. Each AI Prompt and Assistant can be given it's own category, allowing customers to choose the right model, provider and even the DATA at every step.
That matters for a few reasons. The LLM market is moving fast - meaningful new model releases are appearing almost weekly in 2026. An organisation that's locked to a single vendor's model at the point of deployment is accumulating strategic risk. A platform that treats the model as a configurable component lets you take advantage of improvements in the market without re-platforming.
It also means your model choices can evolve with your use cases. Starting with Sonnet 4.5 for virtual agent and knowledge workflows is a sensible default. As your team activates more sophisticated capabilities - cross-module intelligence, proactive analytics, agentic automation - the platform is ready to route those to a model profile that matches the ambition.
1. Is Claude Sonnet 4.5 still the recommended model for most ESM workloads?
Yes, and the recommendation has held up well as the market has evolved. Sonnet 4.5 delivers a strong balance of reasoning quality, cost efficiency, and safety for the types of interactions that make up the majority of service management AI workloads. It's not the right choice for every task - complex cross-module synthesis genuinely benefits from Opus-tier reasoning - but as a default configuration it remains well-justified.
2. What about newer models like Mistral, DeepSeek, or Opus? Have they changed the picture?
They've expanded the options rather than changed the fundamental recommendation. DeepSeek is worth evaluating for high-volume workloads where cost is the primary constraint and data sovereignty requirements can be met; but its safety architecture is less mature than Anthropic's for customer-facing applications. Mistral is a credible choice for EU-resident deployments specifically. Opus 4.6 (and more recently 4.7) is excellent and earns its premium cost for the right tasks, primarily coding. None of them make a blanket case to replace Sonnet 4.5 as the everyday workhorse.
3. Should different Servicely capabilities use different models?
Ideally, yes. Virtual agent interactions and standard automation are well-served by mid-tier models. Knowledge generation and cross-module intelligence benefit from stepping up in reasoning capability. The practical starting point is a single well-chosen default model, with a clear plan to differentiate as you activate more sophisticated capabilities.
4. Does Servicely support model routing? How easy is it to swap models?
Servicely is built to support BYO LLM - customers connect their model provider via API key and can update that configuration without rebuilding workflows. The platform does not require a specific vendor. As native routing capabilities mature, the platform will increasingly handle the task-to-model decision automatically.
5. What should we be thinking about beyond model capability?
Data governance and residency are the considerations that most commonly constrain model choice in practice. If your organisation operates under GDPR, industry-specific data handling requirements, or has made commitments about where data is processed, those constraints need to shape your model shortlist before performance benchmarks do. Not all models offer the same hosting options or compliance posture - and it's a much easier conversation to have upfront than after deployment.
A note on frontier models
New models are being released every day – like with all software changes, it’s critical that you test, evaluate, and understand the changes between these models and their behaviour. For example, we’ve found that recent changes to Claude’s Opus models make them less capable for advanced reasoning, or more prone to errors and mistakes. Be mindful when making your selection that things change regularly in this space, and trial and error is critical.
---
Want to talk through what a multi-model configuration would look like for your environment?
Reach out to your Servicely solutions consultant or book a demo to see AI in action.
