>_ AI CODING HARNESS LEADERBOARDS

loading snapshot...

cross-harness, cross-model agentic-coding scores

Same model, different harnesses, very different results. The harness owns context curation, tool design, retry policy, and verifier integration, so a Claude Opus 4.7 inside Claude Code can score 20+ points higher on Terminal-Bench than the same model called by Aider with default settings. This page tracks the harness gap as a first-class number. Snapshots are aggregated weekly from public leaderboards. We do not re-run benchmarks; each row links to the upstream report.

fetching harness data...

WHAT IS A HARNESS?

The harness is the agent scaffold that wraps a base language model: the system prompt, the tools, the file-edit format, the retry and verifier loop, the context-management strategy, and the way errors are surfaced back to the model.

Claude Code, Anthropic's CLI-native agent, default Opus or Sonnet
Cursor, IDE-integrated agent, model-agnostic but Anthropic-default
Codex CLI, OpenAI's terminal agent, default GPT-5.x family
Aider, open-source pair-programmer, broadest model support
OpenHands, formerly OpenDevin, open-source autonomous agent
Devin, Cognition's SaaS agent, mixed-model internal selection
SWE-Agent, Princeton's reference scaffold
Cline, open-source VSCode agent

FREE API

GET /api/harnesses

Returns the same dataset as JSON. Pass ?view=summary for top combined leaderboard and biggest harness gaps, ?view=gaps for full harness-gap analysis, ?view=combined for normalized cross-benchmark ranking, or no param for the raw benchmark graph. 12-hour cache. CORS enabled. No auth, no key, no signup.

Also available as MCP tool tf_harnesses via /api/mcp and as a function-calling tool definition at /api/llm-tools.