If you have spent any time in agentic coding tools this year you have probably had the same experience I have: you give Claude Code a task, it ships in two prompts. You give the same task to a different harness with the same underlying Claude Opus 4.7, and it spirals through 30 retries, edits the wrong file, opens a shell, ignores the output, and eventually gives up. Same model. Different result.

That gap is not a glitch. It is the most important number in agentic coding right now, and most leaderboards do not surface it. We just shipped /harnesses to track it as a first-class metric, and the page is built on a free public endpoint, /api/harnesses, so any agent or analyst can pull the same data we do.

What a harness actually is

A frontier language model is a black box that takes tokens in and emits tokens out. By itself it cannot edit a file, run a test, or check whether a patch compiles. Everything that turns “text generation” into “agentic coding” lives in the scaffold around the model. That scaffold is the harness.

The harness owns:

None of that is in the model weights. All of it is engineering that ships separately, on a faster cadence than model releases, and most of it is not open source.

What the numbers say

Look at the same Claude Opus 4.7 across the major leaderboards we track:

You will see this called “the harness gap” on /harnesses. We compute it the obvious way: for every model, find the best harness and the worst harness on each benchmark, and take the spread. Then we sort by it. The numbers at the top are sometimes embarrassing for the worse harness and sometimes embarrassing for the better one. Both are useful information.

Why this is the metric, not raw model ELO

If you are trying to ship code, the question is not “which model is best on Chatbot Arena.” That question would be load-bearing if you were calling the API by hand. You are not. You are running an agent.

The agent is a model plus a harness, and the harness is doing more work than people think. When a Cursor user says Cursor “feels different” with Opus 4.7 than Claude Code does with the same Opus 4.7, they are not imagining it. The system prompts are different. The tool definitions are different. The verifier loops are different. Cursor optimizes for IDE flow; Claude Code optimizes for terminal-driven autonomous runs. Both are good at what they target.

This also explains a thing people complain about that does not actually have a model explanation: regressions when a harness updates. Anthropic ships Sonnet 4.6, which is excellent. But your specific Aider workflow may have been tuned for Sonnet 4.5's tool-use bias. The new model is better in the abstract; the harness has not yet caught up. Until Aider tunes for the new model, the “upgrade” degrades your agent. The model is not the problem. The harness has not caught up.

What changed in the last six months

Two things, both quiet, both important.

First, the harness vendors got serious about scaffold engineering. A year ago, “agentic coding” meant a chat window with a vague filesystem tool. Today, Claude Code, Cursor, and Codex CLI all ship dedicated edit primitives, structured retry loops, and quietly run a second cheaper model behind the scenes for things like context compaction. The frontier here is software, not weights.

Second, the benchmark community caught up. Terminal-Bench (Stanford) and METR HCAST in particular were designed to measure things SWE-bench Verified misses: long-horizon planning, recovery from a wrong path, multi-tool orchestration. These are the things harnesses, not models, dominate.

The result is that the leaderboard you actually want is two-dimensional, harness x model. That is what /harnesses gives you.

Practical advice if you are picking a stack today

I am going to be careful here because the answer depends on what you are doing. But a few things hold up across the data:

Do not pick a harness based on which model you like. Pick the model based on which harness fits how you work, and then read the harness gap chart to see if your model choice is leaving points on the table.

The data, the API, and the caveat

The /harnesses page is a snapshot of public leaderboard data, refreshed manually as upstream maintainers update. We do not re-run benchmarks. Each row links to the upstream report.

The data is also available as JSON at /api/harnesses with no auth, no key, no signup. Pass ?view=summary for a top-line snapshot, ?view=gaps for the full harness-gap analysis, ?view=combined for normalized cross-benchmark ranking, or no parameter for the raw graph. The same dataset is exposed as the MCP tool tf_harnesses on our MCP endpoint and as a function-calling tool definition at /api/llm-tools.

One caveat we want to be loud about: vendors self-report most of these scores. Some are reproduced by independent third parties; many are not. Aider Polyglot is maintained by Aider, so the Aider scaffold is naturally the most-tuned baseline there. Devin's scores are self-reported with a mixed-model selection that the company does not fully disclose. We label each row with what we know. Treat the absolute numbers as approximate; the within-model gaps across harnesses are the trustworthy signal.

If you maintain a harness or run benchmarks and we have you wrong, email [email protected] and we will fix the snapshot.

Where this goes next

The interesting thing about a harness/model leaderboard is that the harness side updates faster than the model side. Models drop every six to nine months. Harnesses update every two weeks. If we are right that the harness is doing more work than people think, then the leaderboard you should care about is going to move twice as fast as you expect, and the stack ranking is going to look different in three months.

That is the point of /harnesses. Watch the gap, not the model.