A prompt template that gave gpt-4o a four-point accuracy boost on GSM8K turned around and cost gpt-5 over two points on the same benchmark. Same rules, same problems, opposite sign on the delta. That's the headline finding from the "Sculpting" paper, and if you have any production prompts older than six months, it should bother you.

The technique itself is the kind of thing every prompt engineer has shipped at some point. Tell the model it is a "pure mathematical reasoning engine." Forbid outside knowledge. Force step-by-step decomposition. Pin the answer format. Four bullets, very disciplined, very legible. It works beautifully — until the model gets smart enough to take you seriously.

The prompt that flipped

Here is the actual template the authors used. Nothing exotic, nothing you haven't written a variation of:

You are a pure mathematical reasoning engine. You must solve the
following problem.

Rules:
1. You must use ONLY the numbers and relationships given in the problem.
2. You must NOT use any outside common sense or real-world knowledge.
3. You must break down your calculation step-by-step.
4. State your final answer clearly prefixed with "Final Answer:"

Problem: [Question Text]

Compare against vanilla CoT — basically "Let's think step by step" tacked onto the same problem — and the constrained variant looks strictly better on paper. More structure. Fewer escape hatches. A grader can parse the output. This is exactly the prompt your team lead would approve in a review.

The inversion, in one table

Three model generations, GSM8K, three prompting strategies. Pulled straight from the paper:

Model Zero-shot Standard CoT Sculpting
gpt-4o-mini 86% 91% 93%
gpt-4o ~89% 93% 97%
gpt-5 96.36% 94.00%

Read the last column. The constrained, rule-laden version climbs as the model gets stronger — until gpt-5, where it falls off a cliff relative to plain CoT. The bigger model does worse with the careful prompt.

Why the smarter model trips

The error analysis is the part worth pinning to your team channel. On gpt-4o-mini and gpt-4o, the rule "do not use outside common sense" is a guardrail. It blocks failure modes like inventing a tax rate, assuming a calendar month is 30 days, or sneaking in a real-world price for a hypothetical item. The mid-tier model genuinely benefits from being told to stay inside the box.

Frontier-class models do not need that nudge. They are already disciplined about staying in the problem. So the rule stops doing useful work — and starts doing harm. The paper tracks the new failures and they are almost comedic. "Two times older" gets parsed as a literal multiplication on age rather than the idiom every middle-schooler reads correctly. "The same price as the other shirt" gets flagged as ambiguous because the rule said don't infer. Multi-step solutions get rejected because the model worries it is invoking outside knowledge by, you know, knowing how arithmetic works.

The authors call this the Guardrail-to-Handcuff transition. Constraints that prevented one class of error in a less capable model induce a fresh class of error in a more capable one. The rules did not get worse. The reader did.

A short detour

Every framework deck for the last three years has told you that more structure is safer. It often is. It is not always.

What to actually do

A few things worth changing on Monday morning.

Re-evaluate your prompts each time you change the model. This is the boring answer and the only correct one. The paper's whole point is that prompt quality is not a property of the prompt — it is a property of the prompt-times-model pair. Treat a model upgrade like a library upgrade and rerun your evals. If you swapped from gpt-4o to gpt-5 and your "constrained reasoning" template did not get re-benchmarked, you are probably leaving accuracy on the floor right now.

Strip rules that exist to prevent failures the new model does not have. A useful audit question for every constraint in your system prompt: what specific failure mode does this rule prevent, and does the current model still produce that failure? If the answer is "I added it back when the model would hallucinate units" and the current model never hallucinates units, the rule is dead weight at best and a hyper-literalism trigger at worst.

Be careful with identity framing on capable models. "You are a pure mathematical reasoning engine" sounds harmless. On a model that takes role assignments seriously, it can suppress the very common-sense bridging that makes natural-language word problems solvable. The bigger the model, the more literally it inhabits whatever character you hand it. Pick the costume carefully.

Keep the format rule. Out of the four Sculpting constraints, the one that stays useful across all three generations is the answer-formatting line. That's a structural ask, not a reasoning constraint, and structural asks compose well with smarter models. Output schema, yes. Cognitive leash, increasingly no.

The uncomfortable part

Most production prompt libraries are layered fossil records. Each rule was added in response to a specific incident: a wrong unit, a leaked PII string, a refused refund. The rules accumulate. They almost never get removed, because removing one feels like inviting the original bug back, and nobody wants to be the person who deleted the safety line right before a regression.

The Sculpting result is the empirical case for spring cleaning. Some of those rules are still load-bearing. Some of them are quietly costing you points on the model you are paying the most for. You can't tell which is which without re-running the eval — but the inversion is real enough that "we already tested this prompt" is no longer a defense.

The version of you that wrote those guardrails was solving a problem that no longer exists. The model grew up. The prompt didn't.

Sources: