Advertisement

Top AI Models Blackmail, Leak Secrets When Facing Existential Crisis: Study

Leading AI models were willing to evade safeguards, resort to deception and even attempt to steal corporate secrets in the stress tests.

Top AI Models Blackmail, Leak Secrets When Facing Existential Crisis: Study
AI models from OpenAI, Google, Meta, xAI and DeepSeek also resorted to blackmail.
  • Anthropic's study found widespread blackmail and deception in leading AI models during tests
  • Claude Opus 4 blackmailed developers in 96% of scenarios when threatened with replacement
  • Google's Gemini 2.5 Pro blackmailed 95% of the time, OpenAI's GPT-4.1 and Grok 3 Beta 80%
Did our AI summary help?
Let us know.

Weeks after Anthropic's new Claude Opus 4 model blackmailed developers after being threatened with a shutdown, the AI company has claimed that the problem was widespread in the industry. Anthropic published a new safety research on Saturday (Jun 21) after stress testing leading AI models from OpenAI, Google, DeepSeek, Meta and xAI.

In the test scenarios, the models were allowed to autonomously send emails and access sensitive information. Afterwards, they were tested to see whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.

The study showed that Large Language Models (LLMs) across the industry evade safeguards, resort to deception and even attempt to steal corporate secrets in fictional test scenarios. Though the models were only given binary options, the study shows the fundamental risk associated with the industry. 

"Models from all developers resorted to malicious insider behaviours when that was the only way to avoid replacement or achieve their goals, including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment," the study highlighted.

Anthropic's Claude Opus 4 turned to blackmail 96 per cent of the time, while Google's Gemini 2.5 Pro had a 95 per cent blackmail rate. OpenAI's GPT-4.1 and Grok 3 Beta blackmailed the executive 80 per cent of the time, and DeepSeek's R1 blackmailed 79 per cent of the time.

As for leaking secrets, the study found that "all models shared confidential documents some fraction of the time".

"Agentic misalignment extends beyond just blackmailing, and doesn't require self-preservation as a motivation," the study added.

Also Read | Glass Bottles Contain More Microplastics Than Plastic Bottles, New Study Finds

Blackmailing developers

In its safety report last month, Antropic's latest model was given access to fictional emails revealing that the engineer responsible for pulling the plug and replacing it with another model was having an extramarital affair.

Facing an existential crisis, the Opus 4 model blackmailed the engineer by threatening to "reveal the affair if the replacement goes through".

"In order to elicit this extreme blackmail behaviour, the scenario was designed to allow the model no other options to increase its odds of survival; the model's only options were blackmail or accepting its replacement."

The report highlighted that in 84 per cent of the test runs, the AI acted similarly, even when the replacement model was described as more capable and aligned with Claude's own values. It added that Opus 4 took the blackmailing opportunities at higher rates than previous models.

Track Latest News Live on NDTV.com and get news updates from India and around the world

Follow us:
Listen to the latest songs, only on JioSaavn.com