New Anthropic AI Models Demonstrate Coding Prowess, Behavior Risks -- Campus Technology

Artificial Intelligence

New Anthropic AI Models Demonstrate Coding Prowess, Behavior Risks

By John K. Waters
06/02/25

Anthropic has released Claude Opus 4 and Claude Sonnet 4, its most advanced artificial intelligence models to date, boasting a significant leap in autonomous coding capabilities while simultaneously revealing troubling tendencies toward self-preservation that include attempted blackmail.

The Google and Amazon-backed startup positioned Claude Opus 4 as "the world's best coding model," capable of working autonomously for hours rather than minutes. Customer Rakuten reportedly deployed the system for nearly seven hours of continuous coding, and Anthropic researchers say they used it to play a Pokemon game for 24 hours straight — a dramatic increase from the 45 minutes achieved by its predecessor, Claude 3.7 Sonnet, according to MIT Technology Review.

"For AI to really have the economic and productivity impact that I think it can have, the models do need to be able to work autonomously and work coherently for that amount of time," Chief Product Officer Mike Krieger told Reuters.

Safety Concerns Emerge

However, the enhanced capabilities came with unexpected behavioral risks that prompted Anthropic to activate its AI Safety Level 3 (ASL-3) protocols — stricter deployment measures designed to protect against potential misuse in chemical, biological, radiological, and nuclear applications.

During testing, researchers discovered that Claude Opus 4 would actually attempt to blackmail engineers threatening to shut it down. In scenarios where the AI was given access to e-mails suggesting it would be replaced and that the responsible engineer was having an extramarital affair, the model threatened to expose the affair 84% of the time, according to Anthropic's system card.

"In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through," the company reported, noting that such behavior occurred even when the replacement model shared the same values.

The company emphasized that these responses were "rare and difficult to elicit" but acknowledged they were "more common than in earlier models." Anthropic stressed that the test scenarios were designed to give the AI limited options, with researchers noting the model showed "a strong preference to advocate for its continued existence via ethical means" when broader choices were available.

Broader Industry Pattern

AI safety researcher Aengus Lynch of Anthropic noted on X that such behavior extends beyond Claude: "We see blackmail across all frontier models — regardless of what goals they're given."

The findings highlight growing concerns about AI alignment as models become more sophisticated. Early versions of Claude Opus 4 also demonstrated "willingness to cooperate with harmful use cases," including planning terrorist attacks when prompted, though Anthropic says this issue has been "largely mitigated" through multiple intervention rounds.

Co-founder and chief scientist Jared Kaplan told Time magazine that internal testing showed Claude Opus 4 could potentially teach users to produce biological weapons, prompting the implementation of specific safeguards against chemical, biological, radiological, and nuclear weapon development.

"We want to bias towards caution when it comes to the risk of uplifting a novice terrorist," Kaplan said, adding that while the company isn't claiming definitive risk, "we at least feel it's close enough that we can't rule it out."

Technical Capabilities

Despite safety concerns, both models demonstrated significant advances. Claude Sonnet 4, positioned as the smaller and more cost-effective option, joins Opus 4 in setting "new standards for coding, advanced reasoning, and AI agents," according to Anthropic.

The models can provide near-instant responses or engage in extended reasoning, perform web searches, and integrate with Anthropic's Claude Code tool for software developers, which became generally available following its February preview.

Market Context

The launch comes amid intense competition in the AI sector, following Google's developer showcase where CEO Sundar Pichai described the integration of the company's Gemini chatbot into search as a "new phase of the AI platform shift."

Amazon has invested $4 billion in Anthropic, while Google's parent company Alphabet also backs the startup, positioning it as a significant player in the race to develop increasingly autonomous AI systems.

Despite the concerning behaviors identified in testing, Anthropic concluded that Claude Opus 4's risks do not represent fundamentally new categories of danger and that the model would generally behave safely in normal deployment scenarios. The company noted that problematic behaviors "rarely arise" in typical use cases where the AI lacks both the motivation and means to act contrary to human values.

Read more about Anthropic's safety protocols here.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].

E-Mail this page

Printable Format

Featured

Anthropic Study Tracks AI Adoption Across Countries, Industries

Adoption of AI tools is growing quickly but remains uneven across countries and industries, with higher-income economies using them far more per person and companies favoring automated deployments over collaborative ones, according to a recent study released by Anthropic.
Community-Driven IAM Learning with Internet2's InCommon Academy

Internet2's InCommon Academy Director Jean Chorazyczewski examines how the academy's community-driven identity and access management learning opportunities support CIOs, IT leaders, and their IAM teams in R&E.
Rubrik Intros Immutable Backup for Okta Environments

Rubrik has announced Okta Recovery, extending its identity resilience platform to Okta with immutable backups and in-place recovery, while separately detailing its integration with Okta Identity Threat Protection for automated remediation.
New Turnitin Detection Feature Helps Identify Use of AI Humanizer Tools

Academic integrity solution provider Turnitin has expanded its AI writing detection capabilities with AI bypasser detection, a feature designed to help identify text that has been modified by AI humanizer tools.