Model Routing: Send Each Task to the Right Model
Why Routing Matters
Most production AI applications use a single model for all requests because it is the simplest architecture. The chosen model is usually the provider's mid-tier option (Claude Sonnet, GPT-4o) because it provides good quality at moderate cost. This one-size-fits-all approach wastes money because a significant portion of requests do not need mid-tier capability. Classification, entity extraction, simple Q&A, format conversion, and templated responses are all handled equally well by economy-tier models at a fraction of the cost.
The distribution of request complexity follows a predictable pattern in most applications. Roughly 30 to 50 percent of requests are simple (Tier 1), 30 to 50 percent are moderate (Tier 2), and 10 to 20 percent are complex (Tier 3). Without routing, all requests pay the Tier 2 price. With routing, Tier 1 requests pay 4x to 10x less, Tier 2 requests pay the same, and Tier 3 requests pay more but represent only 10 to 20 percent of volume. The net savings depend on the exact distribution, but 30 to 55 percent reduction is typical.
Four Routing Approaches
Rule-Based Routing
The simplest approach maps predefined task types to models using explicit rules. Your application already knows what type of task it is performing (classification, search, generation, conversation), so the routing decision is a lookup table rather than a classification problem. Rule-based routing requires zero ML infrastructure, adds zero latency, and is fully transparent (every routing decision can be audited by reading the rules). Its limitation is that it only works when task types are known at the application level, which is the case for structured workflows but not for free-form conversations where the complexity varies within a single task type.
Classifier Routing
A lightweight classifier analyzes each request and predicts the minimum model tier needed. The classifier can be a keyword-based heuristic (fast but rigid), a small ML model trained on labeled examples (accurate but requires training data), or a call to a tiny language model (flexible but adds latency and cost). The classifier runs on every request, adding 2 to 20 milliseconds of latency depending on the approach. The accuracy of the classifier directly determines the quality of routing: an 85 percent accurate classifier misroutes 15 percent of requests, leading to either quality degradation (simple model on complex request) or cost waste (complex model on simple request).
Cascading
Cascading starts with the cheapest model and escalates only when needed. The request goes to Haiku first. If the response passes a quality check (confidence threshold, format validation, or a quick rubric evaluation), it is returned to the user. If it fails the quality check, the request is sent to Sonnet, and potentially to Opus if Sonnet also fails. Cascading optimizes aggressively for cost because it only pays for a more expensive model when the cheaper model demonstrably fails. The trade-off is latency on escalated requests: a request that escalates through all three tiers incurs the cumulative latency and cost of all three model calls.
The quality check is the critical design decision in cascading. Too strict, and most requests escalate (defeating the cost savings). Too lenient, and low-quality responses pass through (degrading user experience). Effective quality checks include: format validation (did the response match the expected JSON schema or structure), confidence markers (does the response contain hedging language like "I'm not sure" or "it's possible"), completeness checks (does the response address all parts of the question), and length checks (is the response suspiciously short for a complex question). Multiple checks can be combined for higher accuracy.
Memory-Informed Routing
When the routing system has access to persistent memory about previous routing decisions and their outcomes, it can make better decisions without needing a more complex classifier. Memory-informed routing stores observations like "classification requests about billing topics are handled well by Haiku" and "multi-step debugging requests always need Sonnet or higher." On future requests, the router recalls relevant routing memories and uses them to inform the model selection, combining the speed of rule-based routing with the adaptability of ML-based classification.
Adaptive Recall's cognitive scoring is well-suited for routing memory because it surfaces recent, frequently accessed, and corroborated routing observations first. A routing decision that consistently works (stored multiple times with corroborating evidence) receives high confidence and influences future routing strongly. A routing decision that sometimes fails (contradicted by failure observations) receives lower confidence and triggers more conservative routing (using a larger model to be safe). This self-correcting behavior means the routing system improves over time without explicit retraining.
Measuring Routing Effectiveness
Track three metrics to evaluate your routing system: cost savings (the difference between actual spend and what you would have spent using a single model for all requests), quality parity (the percentage of routed requests that meet the same quality threshold as the single-model baseline), and escalation rate (for cascading systems, the percentage of requests that escalate to a more expensive model). Effective routing produces cost savings above 25 percent, quality parity above 95 percent, and escalation rates below 20 percent. If quality parity drops below 90 percent, the routing is too aggressive and the tier boundaries need adjustment.
Make routing smarter with memory. Adaptive Recall stores routing outcomes and helps your system learn which requests need which models, improving accuracy over time without retraining classifiers.
Get Started Free