I spent two days last month migrating a production extraction pipeline from GPT-4o to Claude. The prompts were clean. They'd been through three rounds of eval tuning. Every edge case was handled. Accuracy on GPT-4o: 99.39%.
On Claude, I got 68%.
Not because Claude is worse at extraction — it's often better. The prompts were just written in GPT-ese. Every structural choice, every formatting decision, every example placement was optimized for how one specific model processes text. Switch the model, and it's like handing a French contract to a German lawyer who happens to be brilliant but reads a completely different grammar.
This problem has a name now: model drifting. And a recent paper from the PromptBridge team quantified just how bad it gets.
The Portability Gap Is Real and It's Ugly
The PromptBridge researchers tested prompt transfer across seven LLMs — o3, o4-mini, GPT-4o, Llama 3.1 70B, Qwen3-32B, Gemma3-27B, and Llama 3.1 8B — across eight benchmarks spanning coding, software engineering, and planning tasks.
A prompt achieving 99.39% on GPT-5 dropped to 68.70% on Llama 3.1 70B when used directly. Llama's own optimal prompt hit 79.47% on the same task — an 11-point gap that comes purely from formatting mismatch, not model capability.
On SWE-Bench Verified, direct prompt transfer cost a 27% relative performance hit. On Terminal-Bench, 39%.
The pattern holds: the more complex the task, the worse naive transfer performs. Simple classification? You might survive. Multi-step reasoning, code generation, structured extraction? Expect significant breakage.
Each Model Speaks a Different Dialect
Why? Because each model family trained on different data distributions with different formatting conventions. Their instruction-following behaviors diverge in predictable ways once you know where to look.
Claude parses XML natively. Tags like <task>, <context>, <output_requirements> map to how Anthropic structures their own internal system prompts. Studies have found the model processes XML-tagged instructions substantially more accurately than equivalent markdown headers — roughly 23% better in one 2024 benchmark.
GPT-series models absorbed the internet's markdown. Headings, bullet lists, fenced code blocks — that's their native structure. They also respond well to redundant constraints and clearly labeled sections (### Instructions, ### Evaluation Criteria). Spelling things out twice in different ways tends to help rather than hurt.
Gemini sits in between, handling both XML-style tags and markdown reasonably well. The catch: it's particularly sensitive to consistency. Mixing XML and markdown within the same prompt degrades Gemini's performance more sharply than picking either format and committing to it.
These aren't arbitrary quirks. They emerge from training data. Claude saw XML-heavy data in its RLHF pipeline. GPT ate the web's markdown. The models literally internalized different grammars for conveying the same semantic content.
What PromptBridge Actually Does
The framework attacks model drifting in two stages.
Stage 1 — Calibration. A system called MAP-RPE (Model-Adaptive Reflective Prompt Evolution) generates model-specific optimal prompts through reflective feedback loops. A separate reflection model analyzes past generations, scoring them on syntax validity (35% weight), entry-point verification (35%), risk-free pattern detection (20%), and absence of undesirable constructs (10%). Twenty calibration questions per task. Twenty evolution iterations with ten local refinement steps each. All gradient-free — no fine-tuning.
Stage 2 — Transfer. A Mapping Extractor analyzes source-target prompt pairs and distills transformation patterns — the systematic adjustments a prompt needs when moving from Model A to Model B. An adapter model applies those patterns to unseen tasks at inference time without retraining.
Results on HumanEval (GPT-4o → o3): 97.15% accuracy versus 92.27% with direct transfer. CodeContests: 56.36% versus 48.61%. And the system only needs twenty calibration examples to learn a model's dialect — calibrate once per target model, then transfer works across tasks.
The Uncomfortable Reality for Everyone Else
PromptBridge is a research framework that uses GPT-5 as its mapping extractor. You're not pip-installing this into your CI pipeline next week.
For most teams, the practical situation remains: switch models, re-eval your prompts, re-tune them. No shortcut exists.
Some heuristics that reduce the pain:
Separate semantics from syntax. Structure your prompts into clear logical components — role, context, task, constraints, output format. The semantic skeleton can stay constant; only the formatting layer (XML tags vs. markdown headers vs. plain delimiters) needs to change per model.
Test on three models before you commit. If a prompt only works on one model, it's too dialect-specific. If it works on two of three with minor tweaks, you've found a portable core worth preserving.
Maintain a dialect cheat sheet. Claude gets XML tags. GPT gets markdown sections. Gemini gets one format used consistently. When migrating, swap the syntax layer and keep the semantics intact.
Budget real time for migration. Treat a model switch like a database migration, not a config change. A team moving critical paths from GPT-4o to Claude should allocate days for prompt re-optimization, not hours.
Where This Is Heading
The prompt portability gap reveals something the industry has been quietly ignoring: prompts are code, and code written for one runtime doesn't automatically execute on another. Nobody expects a Python script to run on the JVM without modification. Why do we expect a GPT prompt to run identically on Claude?
Tools are starting to address this. Flompt compiles prompts from a shared visual representation into model-specific formats. Microsoft's Prompty framework treats prompts as typed, portable assets. PromptBridge tackles it at the research level with automated dialect translation.
Until these tools mature, the burden stays on practitioners. Know your model's dialect. Test across runtimes. And stop treating prompts as write-once, run-anywhere artifacts. They never were.