OpenAI's own audit found top AI models score 35 points higher on SWE-bench Verified than on private-code tests — not because they're smarter, but because they're remembering. Here's what this means for every founder picking an AI coding stack today.
SWE-bench is the most widely cited benchmark for AI coding ability. Models are given real GitHub issues from popular Python repositories — Django, scikit-learn, pytest — and must write a patch that fixes the bug without breaking existing tests. Score 80%, solve 80% of the issues.
It's a useful proxy. But a proxy is not the thing.
The problem: those 500 Python issues appeared on the public internet before the benchmark was published. Top models trained on that internet have, in some cases, already seen the answers. They're not solving the problems from scratch — they're retrieving them.
OpenAI's audit found GPT-5 High scores roughly 55% on SWE-bench Verified — but only 23.3% on SWE-bench Pro, which uses private codebases that are legally inaccessible to model trainers. The contamination is structural, not accidental.
The benchmark figures you've been reading in model comparison posts are inflated. Here's the current picture, as of May 2026, using SWE-bench Pro scores where available (private-code performance, which is your actual use case):
| Model | Verified Score | Pro Score (Private Code) | Gap |
|---|---|---|---|
| Claude Opus 4.7 | 87.6% | ~45% | ~42pt |
| GPT-5.3 Codex | 85% | ~35% | ~50pt |
| GPT-5.4 | 82.1% | ~30% | ~52pt |
| Claude Sonnet 4.7 | 77.2% | ~28% | ~49pt |
| Gemini 3.1 Pro | 80.6% | ~25% | ~55pt |
| DeepSeek V4 Pro | 80.6% | ~20% | ~60pt |
Note: Pro scores are estimated based on the Verified-to-Pro ratio observed in OpenAI's published audit (Claude Opus 4.5: 80.9% Verified → 45.9% Pro; GPT-5: 55% Verified → 23.3% Pro). Exact Pro scores vary by agent scaffolding. Treat these as directional, not precise.
For most founders, the relevant question is not "which model scores highest on a public benchmark?" It's "which model solves the most problems in my codebase?"
Your codebase is not Django. Your bugs are not in scikit-learn's issue tracker. Every model scores lower on private code — and the models that score highest on Verified often have the biggest gaps, because they've been most thoroughly trained on the public Python data that SWE-bench draws from.
What actually matters:
Pick 10–20 representative issues from your actual codebase. Run each model through them with the same agent scaffolding. Measure patch pass rate, not SWE-bench. This takes half a day and will save you from building around an inflated benchmark number.
Claude Opus 4.7 is the current Verified leader at 87.6%. GPT-5.3 Codex is at 85%. The gap between them on your private codebase is probably smaller than the gap between either of them and a model at 70% — and a model at 70% that costs $0.30/M tokens might be your best economic choice.
If you're building a TypeScript/React product, run models against TypeScript issues. If you're in Go or Rust, use language-specific benchmarks. The Python-only bias in SWE-bench means it systematically over-rewards models that were trained heavily on Python data.
As your agentic workflows mature, build a lightweight regression suite of issues from your repo. Run it after every model upgrade. This is the only benchmark that matters for your business, and it becomes more valuable as your codebase grows.
Based on verified performance, cost, and practical availability:
The Verified leaderboard position is a marketing artifact as much as a capability signal. The models near the top are genuinely capable — but the differences between them on your specific problem are likely smaller than the cost differences. Pick for your economics, not the benchmark.
Get a free evaluation framework for benchmarking AI coding models against your actual codebase, plus weekly insights on the tools and models actually working for AI-first founders.
Join the Free CommunityBenchmark data sourced from OpenAI internal audit (SWE-bench Verified vs. SWE-bench Pro), BenchLM.ai leaderboard (updated May 5, 2026), and Requesty.ai SWE-bench rankings. Pro scores are estimates based on the Verified-to-Pro ratio observed in published audits. Always validate on your own codebase.