Same model, different harnesses, very different results. The harness owns context curation, tool design, retry policy, and verifier integration, so a Claude Opus 4.7 inside Claude Code can score 20+ points higher on Terminal-Bench than the same model called by Aider with default settings. This page tracks the harness gap as a first-class number. Snapshots are aggregated weekly from public leaderboards. We do not re-run benchmarks; each row links to the upstream report.
The harness is the agent scaffold that wraps a base language model: the system prompt, the tools, the file-edit format, the retry and verifier loop, the context-management strategy, and the way errors are surfaced back to the model.
GET /api/harnesses
?view=summary for top combined leaderboard and biggest harness gaps,
?view=gaps for full harness-gap analysis,
?view=combined for normalized cross-benchmark ranking,
or no param for the raw benchmark graph.
12-hour cache. CORS enabled. No auth, no key, no signup.
tf_harnesses via /api/mcp and
as a function-calling tool definition at /api/llm-tools.