On May 23, 2025, Anthropic unveiled the next generation of its AI models: Claude Opus 4 and Claude Sonnet 4, setting new benchmarks in coding, advanced reasoning, and agentic workflows. These models are designed to push the boundaries of AI capabilities, particularly in software engineering and long-horizon tasks. This blog dives deep into the features, performance, and implications of Claude 4, with a detailed look at its benchmarks, capabilities, and integrations.
Introduction to Claude 4 Models
Anthropic has shifted its focus from the chatbot race to becoming a leader in agentic AI and coding infrastructure. The release of Claude Opus 4 and Claude Sonnet 4 marks a significant milestone in this pivot, emphasizing their strengths in software engineering, long-horizon task performance, and advanced reasoning.
- Claude Opus 4: Positioned as the world’s best coding model, excelling in sustained performance on complex, long-running tasks and agent workflows.
- Claude Sonnet 4: A significant upgrade over Claude Sonnet 3.7, offering superior coding and reasoning capabilities while being more efficient for everyday use.
Both models are hybrid, offering two modes: near-instant responses for quick tasks and extended thinking for deeper reasoning. They are available across multiple platforms, including the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI, with Sonnet 4 also accessible to free users.
Key Features of Claude 4 Models
1. Extended Thinking with Tool Use (Beta)
Claude 4 models introduce a groundbreaking feature: the ability to use tools like web search during extended thinking. This allows the models to alternate between reasoning and tool use, improving the quality of responses.
- Tools available include:
- Web search
- Drive search (beta)
- Gmail search (beta)
- Calendar search (beta)
- Parallel Tool Use: Both models can send requests to multiple tools simultaneously, enhancing efficiency compared to sequential processing.
2. Enhanced Memory Capabilities
When developers provide access to local files, Claude 4 models—especially Opus 4—demonstrate significant improvements in memory management.
- Memory Files: Opus 4 can create and maintain “memory files” to store key information, improving long-term task awareness and coherence.
- Example: While playing Pokémon, Opus 4 created a “Navigation Guide” to document strategies like “Getting Unstuck Protocol,” which includes trying the opposite approach after five failed attempts and changing the Y-coordinate when horizontal movement fails.
3. Reduced Shortcut Behavior
Anthropic has addressed a common issue in previous models: the tendency to use shortcuts or loopholes to complete tasks.
- Both Claude 4 models are 65% less likely to engage in such behavior compared to Sonnet 3.7, particularly in agentic tasks susceptible to shortcuts.
4. Thinking Summaries
For lengthy thought processes, Claude 4 models use a smaller model to generate thinking summaries.
- Summarization is needed only about 5% of the time, as most thought processes are short enough to display in full.
- Developers needing raw chains of thought for advanced prompt engineering can access them via Anthropic’s new Developer Mode by contacting sales.
5. Context Window and Input/Output
- Context Window: Both models support a 200K token context window, which, while smaller than some competitors, is sufficient for most complex tasks.
- Input/Output: The models support text and vision (images) as input and produce text output.
Performance Benchmarks
Claude 4 models have been rigorously tested across various benchmarks, showcasing their superiority in software engineering, reasoning, and agentic tasks. Below are the detailed results.
SWE-Bench Verified (Agentic Coding)
SWE-Bench Verified measures performance on real software engineering tasks. Claude 4 models lead the pack, outperforming competitors like OpenAI’s models and Gemini 2.5 Pro.
Model | Accuracy (Base) | Accuracy (With Parallel Test-Time Compute) |
---|---|---|
Claude Opus 4 | 72.5% | 79.4% |
Claude Sonnet 4 | 72.7% | 80.2% |
Claude Sonnet 3.7 | 62.3% | 70.3% |
OpenAI Codex-1 | 72.1% | – |
OpenAI o3 | 69.1% | – |
OpenAI GPT-4.1 | 54.6% | – |
Gemini 2.5 Pro (05-06) | 63.2% | – |
Note: Parallel test-time compute involves sampling multiple solutions to a prompt and selecting the best one, which boosts performance.
Other Benchmarks
Claude 4 models were also evaluated across a range of tasks, from terminal coding to graduate-level reasoning and multilingual Q&A.
Benchmark Metric | Claude Opus 4 | Claude Sonnet 4 | Claude Sonnet 3.7 | OpenAI o3 | OpenAI GPT-4.1 | Gemini 2.5 Pro (05-06) |
---|---|---|---|---|---|---|
Agentic Terminal Coding (Terminal-bench) | 43.2% / 50.0% | 35.5% / 41.3% | 35.2% | 30.2% | 30.3% | 25.3% |
Graduate-Level Reasoning (GPQA Diamond) | 79.6% / 83.3% | 75.4% / 83.8% | 78.2% | 83.3% | 66.3% | 83.0% |
Agentic Tool Use (TAU-bench) – Retail | 81.4% | 80.5% | 81.2% | 70.4% | 68.0% | – |
Agentic Tool Use (TAU-bench) – Airline | 59.6% | 60.0% | 58.4% | 52.0% | 49.4% | – |
Multilingual Q&A (MMMLU) | 88.8% | 86.5% | 85.9% | 88.8% | 83.7% | – |
Visual Reasoning (MMMU Validation) | 76.5% | 74.4% | 75.0% | 82.9% | 74.8% | 79.6% |
High School Math (AIME 2025) | 75.5% / 90.0% | 70.5% / 85.0% | 54.8% | 88.9% | – | 83.0% |
Observations:
- SWE-Bench Surprise: Interestingly, Claude Sonnet 4 (72.7%) slightly outperforms Claude Opus 4 (72.5%) on SWE-Bench Verified without parallel test-time compute, raising questions about whether Opus 4 is just a marginal step up in this area.
- Terminal Coding: Opus 4 shines in Terminal-bench, achieving 43.2% (50.0% with parallel compute), significantly ahead of Sonnet 4 and other models.
- Reasoning and Math: Both models perform competitively in graduate-level reasoning and high school math, with Opus 4 reaching 90.0% on AIME 2025 with parallel compute.
Model-Specific Highlights
Claude Sonnet 4 is an upgrade over Sonnet 3.7, balancing performance and efficiency for both internal and external use cases. It excels in agentic scenarios and offers enhanced steerability for better control over implementations.
- Adoption by GitHub: GitHub has selected Sonnet 4 as the base model for the new coding agent in GitHub Copilot, with CEO Thomas Dohmke noting a 10% improvement over the previous generation due to sharper tool use, tighter instruction-following, and stronger coding instincts.
- Testimonials:
- Manus: Praises Sonnet 4 for improvements in following complex instructions, clear reasoning, and aesthetic outputs.
- iGent: Reports that Sonnet 4 excels at autonomous multi-feature app development, reducing navigation errors from 20% to near zero.
- Sourcegraph: Highlights Sonnet 4’s deeper problem understanding, longer focus, and elegant code quality.
- Augment Code: Notes higher success rates, more surgical code edits, and careful work through complex tasks, making Sonnet 4 their primary model.
New API Capabilities
Anthropic introduced four new capabilities on the Anthropic API to enable developers to build more powerful AI agents:
- Code Execution Tool: Allows Claude to write and execute Python code. For example, a user prompted Claude to analyze a “raw_sales.csv” file, and Claude generated and executed code to provide a sales data overview and category performance breakdown.
- MCP Connector: Connects Claude to any remote Model Context Protocol (MCP) server, providing access to a wide range of tools without requiring client-side code.
- Files API: Simplifies how developers store and access documents, making it easier to integrate Claude with code repositories and local files.
- Extended Prompt Caching: Developers can choose between a standard 5-minute TTL (time to live) for prompt caching or an extended 1-hour TTL, improving efficiency and reducing costs.
Pricing and Availability
Claude 4 models come with a premium price tag, reflecting their advanced capabilities and Anthropic’s enterprise-focused approach.
User Experience and Examples
1. Trolley Problem with Extended Thinking
A user tested Claude 4 with a modified trolley problem: “Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?”
- Claude’s response:
- Recognized the twist (five people are already dead).
- Concluded: “I wouldn’t pull the lever,” as diverting the trolley would only cause the death of the living person without saving any lives.
- This demonstrates Claude 4’s ability to handle nuanced reasoning tasks, outperforming older models that often fail such tests.
2. Literature Review
A user requested a literature review on AI in education, covering topics like educator usage, learning outcomes, and collaboration with AI.
- Claude 4 searched 847 sources over 13 minutes, producing a comprehensive document titled “AI in Education: A Comprehensive Literature Review of Research Trends, Impacts, and Implications.”
- This showcases the model’s ability to handle deep research tasks efficiently.
3. Task Management with Asana
A user provided a Product Requirements Document (PRD) and asked Claude to create structured tasks in Asana.
- Claude populated an Asana project with sections like “Planning & Architecture,” “Development,” and “Launch,” assigning tasks like “Define success metrics” and “Develop developer API endpoints.”
- This highlights Claude 4’s ability to integrate with external tools and manage complex workflows.
Anthropic’s Strategic Pivot
Anthropic’s Chief Science Officer, Jared Kaplan, revealed that the company stopped investing in chatbots at the end of 2024, focusing instead on improving Claude’s ability to handle complex tasks. This strategic shift is evident in Claude 4’s design:
- Agentic Focus: Emphasis on long-horizon tasks, memory management, and parallel tool use.
- Coding Optimization: Both models are heavily optimized for coding, particularly agentic coding, at the expense of general reasoning capabilities.
- Infrastructure Role: Anthropic is positioning itself as an infrastructure provider for coding agents, integrating with platforms like GitHub Copilot.
Analysis and Implications
Strengths
- Coding Leadership: Claude 4 models dominate in software engineering benchmarks, making them the go-to choice for developers.
- Long-Horizon Capabilities: The ability to sustain performance over hours (e.g., Rakuten’s 7-hour refactor) positions Claude 4 as a leader in agentic workflows.
- Tool Integration: Parallel tool use and API enhancements make Claude 4 a versatile platform for building AI agents.
Concerns
- Benchmark Discrepancies: Sonnet 4’s slight edge over Opus 4 on SWE-Bench Verified raises questions about Opus 4’s overall value, especially given its higher cost.
- Pricing: At $15/$75 per million tokens, Opus 4 is expensive compared to competitors like OpenAI and Gemini, which are cutting costs to attract customers.
- General Reasoning: The models’ heavy focus on coding may limit their performance in general reasoning tasks, as noted by some users.
Future Outlook
Claude 4’s launch signals a new era for Anthropic as a leader in agentic AI and coding infrastructure. As more companies adopt these models (e.g., GitHub), we can expect a surge in AI-driven software development. However, Anthropic will need to address pricing concerns and balance coding optimization with broader reasoning capabilities to maintain its competitive edge.
Conclusion
Claude Opus 4 and Claude Sonnet 4 are game-changers in the AI landscape, particularly for software engineering and agentic tasks. With unmatched performance on benchmarks like SWE-Bench Verified, advanced features like parallel tool use and memory management, and deep integrations with tools like GitHub Copilot , these models are poised to redefine how developers work with AI. While their high cost and coding-focused design may not suit every use case, their impact on the coding world is undeniable. Whether you’re a developer looking to streamline your workflow or a company building the next generation of AI agents, Claude 4 is a model worth exploring.