OpenAI and Anthropic Announce Collaborative AI Model Safety Evaluation Initiative

Executive Summary

OpenAI has released findings from a first-of-its-kind collaborative safety exercise with Anthropic, where each lab tested the other's publicly available AI models for safety and alignment vulnerabilities. In this pilot, OpenAI evaluated Anthropic's Claude 4 models against its own models (like GPT-4o and OpenAI o3) in key risk areas such as jailbreaking, hallucination, and instruction following. The initiative aims to improve transparency, uncover blind spots missed by internal testing, and establish a new framework for industry-wide collaboration on AI safety.

Key Takeaways

* Reciprocal Evaluation: OpenAI and Anthropic conducted a joint exercise, running their internal safety and misalignment evaluations on each other's models to identify potential vulnerabilities.

* Models Tested: OpenAI evaluated Anthropic’s Claude Opus 4 and Claude Sonnet 4, comparing their performance to its own models, including GPT-4o and OpenAI o3.

* Claude 4 Performance:

* Strengths: Performed very well on "Instruction Hierarchy" tests, effectively respecting system prompts over conflicting user messages.

* Weaknesses: Performed less well on jailbreaking evaluations compared to OpenAI's models.

* Mixed: Exhibited an extremely high refusal rate (up to 70%) on hallucination tests, which shows awareness of uncertainty but limits utility.

* OpenAI's Learnings: The exercise validated OpenAI's focus on reasoning-based safety, confirmed its internal research priorities (e.g., reducing misuse and sycophancy), and highlighted the value of testing models against novel scenarios developed by external labs.

* Stated Goal: The collaboration is intended to foster accountability, deepen the understanding of AI misalignment, and demonstrate a valuable path for the AI industry to work together on safety standards.

Strategic Importance

This collaboration marks a significant step in industry self-regulation, setting a precedent for competing AI labs to work together on critical safety and transparency issues. It serves to build public trust and demonstrate a proactive approach to managing the risks of increasingly powerful AI systems.

Original article