"Think Hard" Is Now a Billing Event

Two months ago I ran the same benchmark prompt through GPT-5 three times. Same API key, same temperature, same max tokens. The first response took 1.2 seconds and cost $0.003. The third took 11 seconds and cost$ 0.04. The only change: I'd prepended "analyze this thoroughly" to the beginning.

That twelve-word edit didn't just change the output. It changed which model answered.

The Dispatcher Behind the Curtain

By now, most practitioners know that GPT-5 is a router — a dispatch layer sitting in front of multiple specialized sub-models. OpenAI hasn't published the full architecture, but the community has reverse-engineered the basics from response latency patterns, token accounting, and the occasional tone wobble mid-response.

The router examines four signals before picking a destination:

Conversation type — casual chat vs. technical analysis vs. creative writing
Task complexity — subtle difficulty markers in your phrasing
Tool needs — mentions of calculations, lookups, or data operations
Explicit intent — phrases like "think deeply" vs. "quick summary"

Behind the router sits a fleet of models: a lightweight core for fast answers, a dedicated thinking model for multi-step reasoning, tool-equipped variants for code and API calls, and a mini fallback for low-stakes queries during traffic spikes. The cost gap between the lightweight core and the thinking model is 10–30x. Not a rounding error — a budget line.

OpenAI serves 700 million weekly active users. The economic incentive to route cheap is structural.

Prompt Words as Infrastructure Signals

Here's what makes this uncomfortable for prompt engineers. Your word choice now has infrastructure consequences that have nothing to do with output quality.

"Think hard about this" → triggers the reasoning model. "Provide deep analysis" → same. "Quick take" → lightweight core. "Brief response" → lightweight core.

You've been optimizing prompts for years along one axis: output quality. Now there's a second axis you're moving along whether you know it or not — compute allocation.

This creates a failure mode I've started calling accidental misrouting. You write a complex technical prompt but phrase it casually — maybe because your team's style is informal. The router reads the tone, not the underlying complexity, and dispatches to the lightweight model. You get a shallow response and blame the model. But the model was fine. The wrong one answered.

The inverse burns money. You phrase a simple lookup formally — "please carefully evaluate and provide a comprehensive analysis of the current weather in Tokyo" — and the router dutifully spins up the thinking model. Eight seconds of latency and a 10x cost multiplier for something a dictionary lookup could handle.

The Tone Shift Tell

Users have started noticing something odd: mid-response personality changes. The answer starts formal and precise, then relaxes into casual diction. Or it opens conversational and suddenly tightens when the reasoning gets hard.

That's the router switching sub-models within a single response. The personality filters that stitch everything into a unified voice aren't perfect. You can catch it in the API's streaming latency too — a suspicious pause mid-stream where the handoff is happening.

Claude Went the Other Way

Anthropic's approach to this problem looks nothing like OpenAI's. No hidden router guessing your intent from word choice. Instead, you get an explicit knob:

{
  "thinking": {
    "type": "enabled",
    "budget_tokens": 5000
  }
}

Or effort levels via the API: none, low, medium, high. You tell the model how hard to think. No surprises, no phantom routing.

The tradeoff is real — you might under-allocate on a problem you underestimated, or over-allocate on something trivial. But at least the cost and latency are predictable. You're not debugging why the same prompt costs $0.003 on Tuesday and$ 0.04 on Wednesday because your CI pipeline added "please be thorough" to the system prompt.

Practical Consequences

Use the API, not the chat interface. The API lets you call specific model variants directly — gpt-5, gpt-5-mini, gpt-5-nano. The chat interface auto-routes and you have zero control over which sub-model handles your request.

Set reasoning_effort explicitly. Don't let the router guess. OpenAI's own guidance recommends defaulting to none, low, or medium — reserving xhigh only for long, reasoning-heavy agentic chains.

Strip sentiment from technical prompts. If you want the thinking model, request it with the parameter, not with emotional language. And if you want the fast model, watch out for complex-sounding jargon that might accidentally signal "hard problem" to the dispatcher.

Budget for the double-token trap. Independent testing has found the orchestration overhead — router decisions, sub-model handoffs, synthesis steps — sometimes doubles the token count compared to equivalent single-model calls. Factor that in before your CFO asks why the AI bill jumped.

The assumption behind a decade of prompt engineering was simple: you're talking to one model, and your words shape its output. That assumption broke. Your words now shape which model answers, how much compute it gets, and what you pay. The prompt became a routing table, and most teams haven't updated their mental model to match.

#The Dispatcher Behind the Curtain

#Prompt Words as Infrastructure Signals

#The Tone Shift Tell

#Claude Went the Other Way

#Practical Consequences

The Dispatcher Behind the Curtain

Prompt Words as Infrastructure Signals

The Tone Shift Tell

Claude Went the Other Way

Practical Consequences