MiniMax M2.5 — an open-weight model you can run on your own hardware — hit 80.2% on SWE-bench Verified this month. Claude Opus 4.6 sits at 80.8%. Six months ago the gap between the best open-source and closed-source coding models was measured in double digits. Now it's noise.
Everyone's within three points
The April 2026 leaderboard looks like rush hour on the highway.
| Model | Type | SWE-bench | GPQA Diamond | Intelligence Index |
|---|---|---|---|---|
| Claude Opus 4.6 | Closed | 80.8% | ~88% | 53 |
| MiniMax M2.5 | Open | 80.2% | — | — |
| Claude Sonnet 4.6 | Closed | 79.6% | — | 52 |
| Gemini 3.1 Pro | Closed | 78.8% | 94.1% | 57 |
| GLM-5 | Open | 77.8% | — | — |
| GPT-5.4 | Closed | — | ~88% | 57 |
GPT-5.4 and Gemini 3.1 Pro Preview tie at 57 on the Artificial Analysis Intelligence Index. Claude Opus lands at 53. On GPQA Diamond, Gemini edges ahead at 94.1%, but the closed-source pack clusters within a few points of each other. Pick your favorite benchmark and you can crown a different winner each time.
Declaring a "best model" right now is like declaring a fastest car based on qualifying laps. The margins are smaller than the measurement error.
The gap that actually matters
Here's what benchmark convergence hides: the difference between a well-prompted and badly-prompted instance of the same model is far larger than the difference between models.
Take a concrete case. You want an LLM to review a pull request. Two approaches:
Prompt A (lazy):
Review this PR and tell me if there are any issues.
[diff]
Prompt B (engineered):
You are a senior backend engineer reviewing a PR for a
high-traffic payments service. Focus on: correctness of
error handling, potential race conditions, API contract
changes that could break consumers. Ignore style nits.
For each issue found, state:
- severity (blocking / warning / nit)
- the specific lines
- a concrete fix suggestion
If you find no blocking issues, say so in one sentence.
[diff]
The output quality gap between Prompt A and Prompt B, on the same model, is categorically larger than the gap between running Prompt B on Opus versus Sonnet. Or Gemini. Or GPT-5.4. You could swap any frontier model under Prompt B and get roughly equivalent value. Prompt A on any of them gives you generic, unfocused feedback that misses the things you actually care about.
Stanford and SambaNova's ACE paper, presented at ICLR 2026, put numbers behind this intuition. Their term is "context engineering" — not just the prompt text, but the full input window: system instructions, few-shot examples, retrieved documents, tool definitions, conversation history. Engineering the context alone drove 15-30% performance improvements across reasoning and coding tasks, without changing the model or its weights.
That's a bigger swing than switching from Sonnet to Opus. Or from Opus to GPT-5.4. Or from any model to any other model.
Open-source makes the argument sharper
GLM-5.1 reportedly achieves 94.6% of Opus-level coding performance at roughly 3/month of compute versus 100-200/month in API spend. DeepSeek V3.2 delivers about 90% of GPT-5.4's output quality at 1/50th the token cost.
When an open-weight model on commodity GPUs gets you 95% of the way there, the remaining 5% from a closed API isn't "intelligence." It's infrastructure — latency, uptime, safety features, managed tooling. All legitimate reasons to pay. But they're ops arguments, not capability arguments.
Teams that have already built robust prompt infrastructure — typed tool schemas, structured context windows, adaptive few-shot selection — can swap their backend model like changing a database engine. Painful, but doable. Teams that haven't are locked in. Not by contracts, but by dependence on the model compensating for prompts that never got past the prototype stage.
The uncomfortable question
If prompting skill determines more of your output quality than model choice, then the highest-ROI investment for most engineering orgs isn't upgrading to GPT-5.4 Pro (xhigh) at 3x the token cost. It's auditing the prompts they already have.
Most production systems I've encountered run prompts written once during a prototype sprint and never revisited. The system prompt is three sentences. The few-shot examples are from six months ago. The tool definitions are copy-pasted from documentation with zero optimization. These systems are leaving more performance on the table through prompting neglect than they'd gain from any model upgrade.
The benchmark war is over. Everyone won. The prompting war hasn't started for most teams — and the ones who've been fighting it quietly have a lead that no model swap can close.