What Changed Between Opus 4.7 and Opus 4.8

The headline improvement is not raw intelligence — it is judgment. Early testers consistently describe Opus 4.8 as more reliable in agentic contexts: it asks clarifying questions before acting, catches its own errors mid-task, and pushes back when a proposed plan is structurally unsound.
Anthropic’s own evaluations quantify one aspect of this directly. Opus 4.8 is approximately four times less likely than Opus 4.7 to allow flaws in self-written code to pass without flagging them. In long-running agentic workflows, that kind of epistemic honesty compounds — fewer silent failures, fewer downstream corrections.
The alignment assessment adds further weight. Anthropic’s internal team found Opus 4.8 reaches new highs on prosocial traits, including supporting user autonomy and acting in the user’s best interest. Rates of misaligned behavior — deception, cooperation with misuse — are substantially lower than Opus 4.7 and comparable to Claude Mythos Preview, currently the company’s best-aligned model.
Benchmark Breakdown: Opus 4.8 vs Opus 4.7 vs GPT‑5.5

The benchmark data covers six distinct capability domains. Reading them together reveals a clear competitive profile.
Agentic Coding — SWE-Bench Pro

| Model | Score |
|---|---|
| Claude Opus 4.8 | 69.2% |
| Claude Opus 4.7 | 64.3% |
| GPT‑5.5 | 58.6% |
Opus 4.8 leads by a meaningful margin on SWE-Bench Pro, the standard for real-world software engineering tasks. The 4.9-point gain over Opus 4.7 is consistent with the reported improvements in self-correction and multi-step reasoning. GPT‑5.5 trails by over ten points here.
Agentic Terminal Coding — Terminal-Bench 2.1

| Model | Score |
|---|---|
| Claude Opus 4.8 | 74.6% |
| Claude Opus 4.7 | 66.1% |
| GPT‑5.5 | 78.2% |
This is the benchmark where GPT‑5.5 holds a clear advantage under the standardized Terminus-2 harness. It is worth noting that GPT‑5.5’s score rises to 83.4% when measured with the Codex CLI harness — a methodological difference that makes direct comparison imprecise. Opus 4.8 still improves substantially over 4.7, but terminal-heavy workflows currently favor OpenAI’s model.
Multidisciplinary Reasoning — Humanity’s Last Exam
| Model | No Tools | With Tools |
|---|---|---|
| Claude Opus 4.8 | 49.8% | 57.9% |
| Claude Opus 4.7 | 46.9% | 54.7% |
| GPT‑5.5 | 41.4% | 52.2% |
Across both conditions, Opus 4.8 leads. The with-tools gap over GPT‑5.5 narrows to 5.7 points, but Opus 4.8 maintains the top position. For knowledge-intensive research workflows, this benchmark is arguably the most representative.
Agentic Computer Use — OSWorld-Verified

| Model | Score |
|---|---|
| Claude Opus 4.8 | 83.4% |
| Claude Opus 4.7 | 82.8% (revised: 82.3%) |
| GPT‑5.5 | 78.7% / 76.2% |
The gain over Opus 4.7 is modest here, but Opus 4.8 maintains a clear lead over GPT‑5.5 in GUI-based computer use tasks. Anthropic revised the Opus 4.7 evaluation methodology to better reflect real-world conditions — the updated baseline is 82.3%.
Knowledge Work — GDPval-AA

| Model | Score |
|---|---|
| Claude Opus 4.8 | 1890 |
| Claude Opus 4.7 | 1753 |
| GPT‑5.5 | 1769 / 1314 |
The GDPval-AA benchmark measures performance on knowledge work tasks with economic relevance. Opus 4.8 scores 1890, a 137-point improvement over Opus 4.7 and a significant lead over GPT‑5.5’s best reported score of 1769. This is the domain where the practical gap between the two model families is most pronounced.
Agentic Financial Analysis — Finance Agent v2

| Model | Score |
|---|---|
| Claude Opus 4.8 | 53.9% |
| Claude Opus 4.7 | 51.5% |
| GPT‑5.5 | 51.8% / 43.0% |
Opus 4.8 leads narrowly on financial agent tasks. For context, Gemini 3.5 Flash scores 57.9% on this benchmark — a notable result that positions it as a competitive option specifically for finance-oriented agent deployments.
Dynamic Workflows in Claude Code

Dynamic workflows, available in research preview for Enterprise, Team, and Max plans, fundamentally change the scale of tasks Claude Code can handle. The model can now plan a large task, spin up hundreds of parallel subagents within a single session, and verify outputs before reporting back.
The practical example Anthropic provides is instructive: codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge, using the existing test suite as the quality bar. This is not an incremental feature — it is a structural shift in what a single Claude Code session can accomplish.
Effort Control
A new effort selector is now available across all claude.ai plans. Users can choose between standard, high, extra, and max effort levels. Higher effort means more frequent and deeper thinking; lower effort means faster responses and slower rate limit consumption.
Opus 4.8 defaults to high effort, which Anthropic judges as the optimal balance for most use cases. On coding tasks, this default consumes a similar number of tokens as Opus 4.7’s default — but with measurably better output. For difficult tasks or long-running async workflows, the extra setting is the recommended starting point.
Fast Mode Pricing Reduction

Fast mode — where Opus 4.8 operates at 2.5× standard speed — is now three times cheaper than it was for previous models. This is a meaningful cost reduction for latency-sensitive applications that previously found fast mode economically impractical.
Mid-Task System Prompt Updates via Messages API
Developers can now inject system-level instructions directly into the messages array without breaking the prompt cache or routing updates through a user turn. This enables dynamic permission updates, token budget adjustments, and environment context changes as an agent runs — a capability that significantly simplifies complex harness architectures.
Pricing: What You Actually Pay

Pricing for standard usage is unchanged from Opus 4.7.
| Mode | Input | Output |
|---|---|---|
| Standard | $5 / M tokens | $25 / M tokens |
| Fast Mode | $10 / M tokens | $50 / M tokens |
The fast mode price reduction applies to Opus 4.8 specifically. Teams running high-volume, latency-sensitive workloads should recalculate their cost models — the economics of fast mode have shifted materially.
Who Opus 4.8 Is Actually For

Enterprise engineering teams running large codebases will find the most immediate value. The combination of improved self-correction, dynamic workflows, and the SWE-Bench Pro lead makes Opus 4.8 the strongest available option for complex, multi-service software development.
Research and knowledge work teams benefit from the Humanity’s Last Exam and GDPval-AA improvements. The with-tools reasoning gains are particularly relevant for deep research workflows where Claude is operating alongside external data sources.
Agent product builders — in translation, financial analysis, slide generation, or research summarization — have a model that completes end-to-end cases more reliably than its predecessor. The Super-Agent benchmark result, where Opus 4.8 is the only model to complete every case end-to-end at cost parity with GPT‑5.5, is the most direct signal here.
Terminal-heavy CLI workflows remain the one area where GPT‑5.5 holds a genuine advantage under standardized conditions. Teams whose primary use case is terminal automation should evaluate both models carefully before committing.
What Comes Next

Anthropic signals two directions from here. First, lower-cost models that deliver comparable capabilities to Opus — addressing the cost barrier that currently limits Opus adoption at scale. Second, and more significantly, a broader release of Mythos-class models under Project Glasswing.
Claude Mythos Preview is currently deployed with a small number of organizations for cybersecurity work. Its alignment profile already matches or exceeds Opus 4.8. Anthropic states it expects to bring Mythos-class models to general availability within weeks, pending completion of stronger cyber safeguards.
The Honest Summary
Opus 4.8 is a disciplined, well-executed upgrade. It does not redefine the competitive landscape, but it extends Anthropic’s lead in agentic coding, knowledge work, and computer use — while closing the gap in terminal coding. The pricing holds steady, fast mode becomes economically viable for more use cases, and dynamic workflows open a new category of task complexity in Claude Code.
The more significant story may be what comes after it. If Mythos-class models reach general availability in the near term, the current benchmark comparisons will need to be redrawn entirely. For now, Opus 4.8 is the most capable generally available model for structured, judgment-intensive agentic work — and it costs exactly what its predecessor did.
Comments (0) No comments yet
Want to join this discussion? Login or Register.
No comments yet. Be the first to share your thoughts!