Why the Harness Gap Matters: The Same Model Inside Two Agents Is Not the Same Agent

If you have spent any time in agentic coding tools this year you have probably had the same experience I have: you give Claude Code a task, it ships in two prompts. You give the same task to a different harness with the same underlying Claude Opus 4.7, and it spirals through 30 retries, edits the wrong file, opens a shell, ignores the output, and eventually gives up. Same model. Different result.

That gap is not a glitch. It is the most important number in agentic coding right now, and most leaderboards do not surface it. We just shipped /harnesses to track it as a first-class metric, and the page is built on a free public endpoint, /api/harnesses, so any agent or analyst can pull the same data we do.

What a harness actually is

A frontier language model is a black box that takes tokens in and emits tokens out. By itself it cannot edit a file, run a test, or check whether a patch compiles. Everything that turns “text generation” into “agentic coding” lives in the scaffold around the model. That scaffold is the harness.

The harness owns:

The system prompt. The hidden instructions that shape every reply. A two-paragraph difference here can swing benchmark scores ten points.
Tool design. Whether the model gets a structured edit_file(path, diff) tool, or a raw shell, or a search-and-replace primitive. Every choice changes how the model thinks about edits.
Context curation. What files the model sees when it asks a question. The naive answer is “everything,” but everything is too much. A good harness has its own idea of how to surface the right ten files instead of the wrong thousand.
The retry and verifier loop. When a test fails, what does the harness do with that failure? Pipe it back as plain text? Strip it? Re-summarize it? Run a second model to triage it?
Error surfacing. The model can only fix what it can see. Harnesses that swallow stderr or truncate stack traces leave the model flying blind.
Memory and compaction. The 1M-token context still fills up. How the harness summarizes long sessions changes which clues survive into the next decision.

None of that is in the model weights. All of it is engineering that ships separately, on a faster cadence than model releases, and most of it is not open source.

What the numbers say

Look at the same Claude Opus 4.7 across the major leaderboards we track:

Terminal-Bench: Claude Code scores 58.2 percent. Aider running the same model scores 38.5 percent. That is a 20-point gap on identical weights.
SWE-bench Verified: Claude Code 79.4. Cursor 76.1. SWE-Agent 64.2. Cline 61.8. All Opus 4.7. The bottom of the harness pile is 15 points behind the top.
METR HCAST: Claude Code holds a 220-minute 50-percent task horizon on Opus 4.7. Aider is at 90 minutes. The model is the same. The harness has more than doubled the autonomous task length.

You will see this called “the harness gap” on /harnesses. We compute it the obvious way: for every model, find the best harness and the worst harness on each benchmark, and take the spread. Then we sort by it. The numbers at the top are sometimes embarrassing for the worse harness and sometimes embarrassing for the better one. Both are useful information.

Why this is the metric, not raw model ELO

If you are trying to ship code, the question is not “which model is best on Chatbot Arena.” That question would be load-bearing if you were calling the API by hand. You are not. You are running an agent.

The agent is a model plus a harness, and the harness is doing more work than people think. When a Cursor user says Cursor “feels different” with Opus 4.7 than Claude Code does with the same Opus 4.7, they are not imagining it. The system prompts are different. The tool definitions are different. The verifier loops are different. Cursor optimizes for IDE flow; Claude Code optimizes for terminal-driven autonomous runs. Both are good at what they target.

This also explains a thing people complain about that does not actually have a model explanation: regressions when a harness updates. Anthropic ships Sonnet 4.6, which is excellent. But your specific Aider workflow may have been tuned for Sonnet 4.5's tool-use bias. The new model is better in the abstract; the harness has not yet caught up. Until Aider tunes for the new model, the “upgrade” degrades your agent. The model is not the problem. The harness has not caught up.

What changed in the last six months

Two things, both quiet, both important.

First, the harness vendors got serious about scaffold engineering. A year ago, “agentic coding” meant a chat window with a vague filesystem tool. Today, Claude Code, Cursor, and Codex CLI all ship dedicated edit primitives, structured retry loops, and quietly run a second cheaper model behind the scenes for things like context compaction. The frontier here is software, not weights.

Second, the benchmark community caught up. Terminal-Bench (Stanford) and METR HCAST in particular were designed to measure things SWE-bench Verified misses: long-horizon planning, recovery from a wrong path, multi-tool orchestration. These are the things harnesses, not models, dominate.

The result is that the leaderboard you actually want is two-dimensional, harness x model. That is what /harnesses gives you.

Practical advice if you are picking a stack today

I am going to be careful here because the answer depends on what you are doing. But a few things hold up across the data:

For autonomous, terminal-driven work, Claude Code on Opus 4.7 Thinking is the strongest harness/model pair we measure across SWE-bench Verified, Terminal-Bench, and METR HCAST. Codex CLI on GPT-5.4 is a close second on SWE-bench and METR but trails on Terminal-Bench.
For IDE-integrated work, Cursor with Opus 4.7 Thinking is the right default, with the caveat that it does not always pick the right model unless you tell it to.
For polyglot editing across many languages, Aider with Opus 4.7 leads its own benchmark by a wide margin. This is partially because Aider Polyglot is run by Aider, but the underlying scaffold is genuinely strong on cross-language edit accuracy.
For Linux server work that must be done by an agent without supervision, the METR HCAST horizon matters more than SWE-bench scores. Claude Code on Opus 4.7 currently leads at roughly 3.7 hours of autonomous task length at 50 percent success. Most other harnesses are at 1 to 2.5 hours.

Do not pick a harness based on which model you like. Pick the model based on which harness fits how you work, and then read the harness gap chart to see if your model choice is leaving points on the table.

The data, the API, and the caveat

The /harnesses page is a snapshot of public leaderboard data, refreshed manually as upstream maintainers update. We do not re-run benchmarks. Each row links to the upstream report.

The data is also available as JSON at /api/harnesses with no auth, no key, no signup. Pass ?view=summary for a top-line snapshot, ?view=gaps for the full harness-gap analysis, ?view=combined for normalized cross-benchmark ranking, or no parameter for the raw graph. The same dataset is exposed as the MCP tool tf_harnesses on our MCP endpoint and as a function-calling tool definition at /api/llm-tools.

One caveat we want to be loud about: vendors self-report most of these scores. Some are reproduced by independent third parties; many are not. Aider Polyglot is maintained by Aider, so the Aider scaffold is naturally the most-tuned baseline there. Devin's scores are self-reported with a mixed-model selection that the company does not fully disclose. We label each row with what we know. Treat the absolute numbers as approximate; the within-model gaps across harnesses are the trustworthy signal.

If you maintain a harness or run benchmarks and we have you wrong, email [email protected] and we will fix the snapshot.

Where this goes next

The interesting thing about a harness/model leaderboard is that the harness side updates faster than the model side. Models drop every six to nine months. Harnesses update every two weeks. If we are right that the harness is doing more work than people think, then the leaderboard you should care about is going to move twice as fast as you expect, and the stack ranking is going to look different in three months.

That is the point of /harnesses. Watch the gap, not the model.