Why Your Prompt Works 80% of the Time

You spent three days on that system prompt. Ran it through eval suites, tuned the wording, squeezed out every last percentage point. Hit 87% accuracy on your test set. Shipped it. And then the support tickets started rolling in — not from random inputs, but from a very specific kind of input your one-size-fits-all prompt quietly botched.

This pattern has a name now, and a fix.

The One-Prompt Assumption

Most prompt engineering advice treats the prompt as a single artifact. You write it, test it against a benchmark, maybe version-control it, deploy it as a constant. Every input — simple or complex, numeric or verbal, familiar or edge-case — hits the same instruction set.

This works fine when your input distribution is narrow. A chatbot answering questions about a single product? One prompt is probably enough. But the moment your inputs span different reasoning styles — arithmetic in one query, causal reasoning in the next, spatial logic after that — a single prompt starts making tradeoffs. It gets good at the average case and silently degrades on everything else.

The research community has been poking at this for a while. The latest and most practical formulation is called Instance-Adaptive Prompting, or IAP.

Saliency Scoring: Measuring Prompt-Input Fit

IAP's core insight is deceptively simple: not all prompts help all inputs equally, and you can measure how well a specific prompt serves a specific input before committing to it.

The mechanism tracks information flow through three channels:

Question → Prompt. Does the prompt reflect what the question is actually asking? A prompt that says "think step by step" might be perfect for a math problem but useless for a factual recall question where the answer is either known or not.

Question → Rationale. How much of the question's content makes it into the model's reasoning? If the model is ignoring key details from the input, the prompt isn't doing its job for this particular case.

Prompt → Rationale. Is the prompt actually shaping the reasoning, or is the model just doing its default thing regardless? Some prompts look good on paper but have zero causal influence on certain input types.

IAP computes saliency scores across these three flows for each candidate prompt against each incoming input. The prompt with the highest combined score wins. It's not magic — it's measuring token-level attention patterns to see which instruction set actually moves the needle for this specific question.

Two selection strategies fall out of this. IAP-ss (Sequential Substitution) evaluates prompts one at a time and stops when it finds one that scores above threshold — faster, solid for production where latency matters. IAP-mv (Majority Vote) runs all candidate prompts and picks the answer that the highest-scoring prompts agree on — slower but more accurate when you can afford the compute.

The Gains Are Real, and Lopsided

IAP doesn't improve everything equally. The results are wildly skewed depending on task type:

Task	Type	Standard Zero-Shot CoT	IAP-mv	Delta
GSM8K	Arithmetic	64.5%	66.3%	+1.8
SVAMP	Math word problems	73.7%	77.3%	+3.6
Causal Judgment	Logic / causality	18.2%	30.0%	+11.8
CommonsenseQA	Common sense	65.0%	68.4%	+3.4

That causal judgment number jumps off the page. Nearly twelve percentage points from just picking a different prompt per input. The arithmetic gains are modest — "think step by step" already works decently for math, so there's less room to optimize. But for tasks where the optimal reasoning strategy varies wildly between instances, adaptive selection is a different game entirely.

This matches intuition if you stop and think about it. Would you use the same prompting strategy for "What causes inflation?" and "If I drop a ball from a moving train, where does it land?" Both benefit from chain-of-thought. But the kind of chain-of-thought they need — economic reasoning versus spatial physics — is fundamentally different. Forcing both through the same template costs you.

Building This Without a PhD

The research papers make IAP sound like it requires gradient-based saliency analysis on every request. The production version is more pragmatic. You need two things: a prompt library and a routing mechanism.

PROMPT_LIBRARY = {
    "analytical": "Break this into component parts and analyze each separately.",
    "step_by_step": "Work through this one step at a time, showing your math.",
    "analogical": "Find a simpler analogous problem, solve that, apply the pattern.",
    "elimination": "List possible answers, eliminate wrong ones with evidence.",
    "direct": "Answer directly and concisely.",
}

async def select_prompt(query: str, classifier) -> str:
    """Route to best prompt based on query characteristics."""
    query_type = await classifier.classify(query)
    return PROMPT_LIBRARY[query_type]

The classifier can be as dumb as a few-shot LLM call that categorizes the input ("is this a math problem, a logic puzzle, a factual question, or a creative task?") or as fancy as a fine-tuned embedding model. Microsoft's PromptWizard and Meta's prompt-ops both support this pattern at a higher level of abstraction. Promptfoo lets you eval different prompt variants against the same test set, which is the essential first step — once you know which prompts work best for which input types, building the router is straightforward.

The harder question is granularity. Do you route by topic? By reasoning complexity? By expected output format? The research on automatic technique selection suggests that semantically clustering your inputs into 5–10 buckets, then mapping each bucket to a prompt variant, captures most of the gain. Going finer hits diminishing returns fast. A separate line of work on constraint-based prompt assembly takes a different approach: instead of picking from a fixed library, it composes prompts from a pool of ~15 techniques — role assignment, reasoning strategy, emotional stimulus — constrained so each generated prompt combines complementary elements. On BIG-Bench Hard tasks, this approach beat both vanilla prompts and Anthropic's built-in prompt generator.

The Prompt Isn't the Product Anymore

What all this really signals is a shift in how prompts work in production. The era of "one system prompt to rule them all" is ending for any application with meaningfully diverse inputs.

Your carefully crafted prompt is still valuable — it just becomes one entry in a library instead of a singleton. The engineering challenge moves from "write the best prompt" to "build the best prompt selector." Which, if you squint, is exactly what happened to ML models a decade ago. Ensembles and routing beat any single model. The same dynamic is playing out one layer up.

The tooling exists. Promptfoo for eval, PromptWizard for optimization, the growing IAP research for the theoretical backbone. The teams that figure out prompt routing at the application layer — not just model routing at the infrastructure layer — are going to have a quiet, compounding edge that's hard to reverse-engineer from the outside.

#The One-Prompt Assumption

#Saliency Scoring: Measuring Prompt-Input Fit

#The Gains Are Real, and Lopsided

#Building This Without a PhD

#The Prompt Isn't the Product Anymore

The One-Prompt Assumption

Saliency Scoring: Measuring Prompt-Input Fit

The Gains Are Real, and Lopsided

Building This Without a PhD

The Prompt Isn't the Product Anymore