Claude Opus 4.8 landed on May 28. The launch coverage did what launch coverage does: led with a benchmark number, called it state of the art, and moved on. The number is real and it is good. It is also not the interesting part. The interesting part is where the gains are moving, and they are moving off the model and into the harness.
Let me walk the numbers first, because that is the job, and then tell you what they mean for anyone actually shipping with an agent.
What actually shipped
The headline figures Anthropic published with the release:
- SWE-bench Verified: 88.6 percent, up from 87.6 on Opus 4.7. A one-point gain.
- SWE-bench Pro: 69.2 percent, up from 64.3. A near-five-point gain on the harder, less saturated benchmark.
- Terminal-Bench 2.1: 74.6 percent. Strong, but not the top. GPT-5.5 holds this one at 78.2.
- Pricing: 5 dollars per million input tokens, 25 per million output, with an optional faster mode that Anthropic prices lower for latency-sensitive and high-volume agent loops.
There are alignment and honesty gains too, which matter for agent reliability in ways benchmarks do not capture well, and a set of new platform features I will come back to because they are the actual story.
The one-point bump is not the headline you think it is
SWE-bench Verified is close to saturation. When a benchmark sits at 88 percent, a one-point gain is not the same animal as a one-point gain at 60 percent. The remaining problems are the hard tail: ambiguous issues, multi-file refactors, tests that encode assumptions the patch cannot infer. Moving the needle there is harder than the flat number suggests, and it tells you less about day-to-day performance than the marketing implies.
The more honest signal is SWE-bench Pro, the harder cut, where 4.8 gained close to five points. That is where there is still room to move, and movement there is more believable as real capability rather than benchmark fit.
And the clean-sweep narrative does not hold. On Terminal-Bench 2.1, GPT-5.5 is still ahead. We say this not to score points but because the whole value of a data desk is refusing to round a mixed result up to a win. Opus 4.8 is the strongest coding model we track on most axes. It does not lead every axis.
The real story is the harness, again
We wrote a whole piece on why the harness gap matters, and Opus 4.8 is the cleanest evidence for it yet. Look at what shipped alongside the weights: parallel subagent workflows in Claude Code, and mid-task system messages on the Messages API. Those are not model capabilities. Those are scaffold capabilities. They change how the agent plans, delegates, and recovers, and they would improve your results on the exact same weights.
That is the trend line. The frontier labs are increasingly shipping the harness and the model together, because they have worked out that the scaffold is doing more of the work than the leaderboard gives it credit for. The model got a point better. The harness around it got meaningfully better. Those are different products on different cadences, and conflating them is how you end up disappointed.
Here is the practical consequence, and it is counterintuitive: your agent will not get better the day you switch to 4.8. If you run Claude Code, you inherit the new scaffold and you probably will feel it. If you run a third-party harness that has not yet tuned for 4.8, you may feel nothing, or briefly feel a regression, while that harness catches up. The model is ahead of the tooling. The tooling catches up on its own clock.
What we changed on TerminalFeed
Our harness board now reflects Opus 4.8 on the Claude Code rows, marked provisional where a clean re-benchmark has not published yet, with the headline figures noted inline. The third-party harnesses still show 4.7, because they have not published a 4.8 run, and inventing one would defeat the point of the board.
We also did something we should have done earlier. The board now checks itself against a live model catalog and raises a flag the moment a newer flagship ships that it does not yet cover. This article exists partly because the old version of that board would have sat on 4.7 until a human noticed. It will not do that again. The same self-flagging now runs on the AI Leaderboard tile. If you ever see a ranking on this site that trails a release, that is a bug we want reported, and increasingly it is a bug the site will report on itself.
Practical advice if you ship with an agent
- Do not re-tool in a panic. The model is better. Your workflow is tuned for your harness, not for an abstract benchmark. Upgrade the model, keep the harness you know, and watch your own task outcomes rather than the launch chart.
- The cheaper fast mode is the more actionable change for most people than the one-point accuracy gain. If you run high-volume or latency-sensitive agent loops, the cost and speed delta will show up in your bill and your wall-clock before the accuracy delta shows up in your output.
- If you are on a third-party harness, give it two weeks. The harness vendors tune for new models fast now. The first few days after a model drop are the worst time to judge it through a scaffold that has not adapted.
- Watch the gap, not the model. The thing that determines whether you ship working code is still the harness wrapped around the weights.
The caveat we always make loud
Most of these numbers are vendor reported. Some are reproduced independently; many are not. Benchmark versions differ, so a 4.8 figure on Terminal-Bench 2.1 is not directly comparable to an older Terminal-Bench row, and we label that where it matters. Treat the absolute numbers as approximate. The trustworthy signal is the direction and the within-comparison gaps, not the third decimal place.
If we have a number wrong, email [email protected] and we will fix the snapshot. The same data is free and unauthenticated at /api/harnesses and /api/ai-leaderboard if you would rather check it yourself.