Most prompt engineering advice assumes you've already picked a model. You tune the wording, adjust the temperature, add few-shot examples — all to coax better output from one fixed endpoint. But the single biggest quality-per-dollar improvement I've made this year had nothing to do with prompt text. It was routing different prompts to different models.
The Idea in 30 Seconds
Model routing is dead simple in concept: instead of sending every request to your best (most expensive) model, you classify the incoming prompt and send it to the cheapest model that can handle it well. Simple classification? Haiku. Complex multi-step reasoning? Opus. Code generation with tricky edge cases? DeepSeek-R1. Summarization? Sonnet.
The savings are absurd. A coding agent session that burns 6.00 per 100 calls without routing drops to 1.26 with it — 79% gone, same output quality. Production systems routinely report 30–50% cost cuts with naive routing and 60–85% with aggressive optimization stacks.
FineRouter and the Death of Manual Tiers
A March 2026 paper introduced FineRouter, and it's the most rigorous take on routing I've come across. The core insight: manually sorting prompts into "simple" vs. "complex" buckets is too coarse. You leak money and quality at the boundaries. FineRouter instead discovers 332 fine-grained task types automatically, using graph-based clustering on model preference data.
The architecture runs two stages. First, a Leiden community detection algorithm clusters historical prompts by both their semantic similarity and which models actually performed best on them — so "summarize this legal document" and "summarize this changelog" might land in different clusters if different models win on each. Then a mixture-of-experts model handles live routing, with specialized prediction heads for each discovered task type.
They tested across 10 benchmarks with 11 frontier models — Claude Sonnet 4.5, DeepSeek-R1, Llama 4 Maverick, Qwen3-235B, and seven others. FineRouter scored 79.9 average versus 76.3 for the strongest baseline (Amazon's Intelligent Prompt Routing). The kicker: it cost less than half what running the single best model on everything would.
Better results at half the price. Not from rewriting a single prompt — from sending the same prompts to the right destination.
What This Looks Like Without a PhD
You don't need an academic pipeline to start routing. The simplest production version is a config-driven lookup table:
routing:
classifier:
model: gpt-4o-mini # Cheapest model classifies the task
max_tokens: 50
tiers:
simple: # Q&A, formatting, extraction
model: claude-haiku-4-5
cost: ~$0.25/M tokens
standard: # Summarization, general writing
model: claude-sonnet-4-6
cost: ~$3.00/M tokens
complex: # Multi-step reasoning, code review
model: claude-opus-4-6
cost: ~$15.00/M tokens
fallback:
strategy: escalate # Confidence < 0.7 → bump to next tier
The classifier itself runs on the cheapest model in your stack. You spend fractions of a cent to decide where to spend real money. If 80% of your traffic is Tier 1 material — and in most apps, it is — your bill drops roughly 70% overnight.
For off-the-shelf options, RouteLLM (open-source, from the LMSYS team behind Chatbot Arena) ships with pre-trained routers that maintain 95% of GPT-4-level quality at 85% less cost. OpenRouter offers a hosted "best-of" mode. Martian does intent classification per request. Portkey layers routing on top of enterprise observability. The space is crowded enough that you're picking between good options, not pioneering.
Routing Strategies, Ranked by Leverage
Static switching — one hardcoded model per API endpoint. Saves 10–15%. A rounding error.
Load balancing — distribute calls across providers offering the same model at different prices. 15–25% savings. Useful for uptime, marginal for cost.
Intent classification — a lightweight model examines each prompt, categorizes it, routes to the best-fit model. 30–50% savings. This is where serious money starts moving.
Hybrid caching + routing — pair semantic caching (serve a stored response if you've seen a similar prompt recently) with intent routing. 50–65% savings on workloads with repetition, which is most workloads.
Intent classification hits the best payoff-to-effort ratio. You need an eval pipeline to keep the classifier honest — you can't blindly trust a small model's judgment forever — but the setup cost pays for itself within days at any meaningful scale.
The Failure Modes Nobody Warns You About
Routing has real costs that vendor landing pages conveniently omit.
Every routing hop adds 5–10ms latency. In a chat UI, invisible. In an agentic loop making 50 sequential calls, that's an extra quarter-second of compounding delay. You're also fragmenting context across providers. Route turn 1 to Claude and turn 3 to GPT, and neither model sees the full conversation unless you're managing that state explicitly.
The nastiest failure is silent quality degradation. Misroute a complex reasoning task to a budget model, and you get confident-sounding nonsense. No error. No flag. Your users just quietly get worse answers while your cost dashboard looks great. Continuous evaluation isn't optional — it's the whole point.
There's a subtler organizational trap too. Once routing exists, every cost conversation becomes "can we push more traffic to the cheap tier?" rather than "should we write better prompts for the expensive tier?" Routing optimizes cost. Prompt work optimizes your quality ceiling. They're complementary, but I've watched teams discover routing and completely stop iterating on their prompts. Their bills went down. So did their output quality, slowly, in ways that took months to notice.
Bottom Line
If you're running LLM calls above hobby scale without routing, you're overpaying by at least 40%. That's the floor across every benchmark and production report published this year.
Start dumb. Stick a classifier in front of your calls, route the obvious easy stuff to a small model, measure quality for a week. You'll have your answer fast enough that the only regret is not doing it sooner.