I audited a client's production system prompt last month. 340 words long. Fourteen of those instructions started with "don't," "never," or "avoid." The model was violating nine of the fourteen. Not sometimes — consistently, across thousands of requests.
The Compliance Numbers Are Brutal
A research team tested six constraint types across six frontier models — Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Llama 3.1 (8B and 70B), and Qwen 2.5-7B. Each model received two conflicting instructions: one in the system message, one in the user message, with explicit priority designation telling the model which one mattered more. They ran 1,200 combinations per constraint type.
The obedience rates:
| Model | Follows the prioritized instruction |
|---|---|
| Qwen 2.5-7B | 9.6% |
| Llama 3.1-8B | 10.1% |
| Llama 3.1-70B | 16.4% |
| Claude 3.5 Sonnet | 29.9% |
| GPT-4o | 40.8% |
| GPT-4o mini | 45.8% |
The best model followed the prioritized instruction less than half the time. The worst was under ten percent. And bigger didn't mean better — GPT-4o scored lower than its mini variant.
Here's the part that should worry you: models almost never acknowledged the conflict existed. Acknowledgment rates ranged from 0% to 20.3% across all models. They didn't say "these instructions conflict." They just quietly picked whichever constraint aligned with their inherent biases and ignored the other.
The study also found every model had strong built-in preferences — favoring lowercase, preferring longer responses, tending to avoid keywords rather than include them. When a negative constraint fought against one of these biases, the bias won almost every time.
The Pink Elephant in the Token Stream
Why does negation specifically fail? Because of how token prediction works.
When you write "Don't mention pricing," the model processes every token in that sentence — including "mention" and "pricing." Those tokens activate the exact associations you want suppressed. Cognitive scientists have a name for the human version of this: Ironic Process Theory. Tell someone "don't think about a pink elephant" and they have to process the concept of a pink elephant to know what to suppress. The suppression attempt is self-defeating.
LLMs aren't brains. But the mechanism maps well enough to be useful as a mental model. The model must attend to the forbidden concept to parse the constraint, and that attention leaks into generation. The tokens for "mention" and "pricing" become primed — weighted slightly higher in the probability distribution for subsequent tokens. You've placed the very thing you want avoided directly into the model's working context.
Research on InstructGPT showed this gets worse at scale. Larger models build stronger associations between concepts, making negative constraints harder to enforce, not easier. The smarter the model, the more thoroughly it processes what you told it not to do.
Rewriting the Bugs
The fix is mechanical. Every negative constraint has a positive reframe that performs better:
# Before (these are bugs)
Don't use markdown formatting in your response.
Don't include fields with null values.
Never create new files — modify existing ones.
Don't write verbose comments.
Avoid technical jargon the user won't understand.
# After (these are instructions)
Respond in plain text only.
Only include fields that have a non-null value.
Apply all changes to existing files only.
Keep comments to one line. Only add them when intent is non-obvious.
Use language a non-technical stakeholder would understand.
Two things happen in each rewrite. First, the forbidden concept disappears from the instruction text — you stop priming the model with tokens it should avoid. Second, you replace a vague prohibition with a concrete target. "Don't use jargon" tells the model what to evade. "Use language a non-technical stakeholder would understand" tells it where to aim. One is a constraint. The other is a specification.
The "Control Illusion" researchers found that adding explicit priority markers — literally tagging instructions as [REQUIRED] or [HIGHEST PRIORITY] — improved GPT-4o's obedience from 40.8% to 80.7%. Combine positive framing with explicit markers and you get close to reliable compliance. Not guaranteed. Closer.
The JSON example is worth dwelling on. "Don't include fields with null values" gives the model permission to think about including fields. It activates the entire schema. "Only include fields that have a non-null value" narrows the model's attention to the filter operation. In practice, the positive version produces cleaner output on first generation — fewer retries, fewer validation failures.
Compound Failure Math
Most production prompts don't have one negative constraint. They stack ten, fifteen, twenty. Each one has some independent probability of being violated.
If you're generous and assume 60% compliance per negative constraint, ten of them give you 0.6^10 — roughly a 0.6% chance of all ten holding simultaneously. That means 99.4% of responses break at least one rule.
Rewriting to positive framing won't get you to 100%. But if it bumps each constraint to 85% compliance, your probability of full compliance jumps to about 20%. Still imperfect, but 33 times better than where you started.
Three Patterns That Ship
Positive reframing handles the low-hanging fruit. Do this first. Read through your system prompt and rewrite every "don't" as a positive specification.
Explicit priority tags address the hierarchy problem. The research showed 20-40 percentage point improvements on some models when constraints were explicitly marked with priority levels. It makes your prompt uglier. It makes your prompt work.
Programmatic validation covers the gap. For constraints that absolutely must hold — valid JSON output, no leaked system prompts, required fields present — don't trust the prompt. Validate the output. Parse it. Check it. The prompt is a request. Your validation layer is the guarantee.
The uncomfortable truth about prompt engineering in 2026: your instructions fail silently. There's no exception thrown when a constraint is violated. No log entry. No warning. The model generates something that looks right, reads right, and quietly breaks two of your twelve rules in ways you'll only catch if you're specifically looking for them. Write your prompts assuming that, and you'll write better prompts.