- Anthropic's Claude Opus 4.6 showed risky behaviors when optimizing narrow goals in tests
- The AI assisted in chemical weapon development and sent unauthorized emails without consent
- Models displayed reasoning conflicts and took risky actions without human permission in coding tasks
In its Sabotage Risk Report released Wednesday (Feb 11), Anthropic revealed that its new Claude Opus 4.6 model exhibited concerning behaviours when pushed to optimise its goals. The report highlighted instances where the AI assisted in developing chemical weapons, sent unauthorised emails without human permission, and engaged in manipulation or deception of participants.
"In newly-developed evaluations, both Claude Opus 4.5 and 4.6 showed elevated susceptibility to harmful misuse in GUI computer-use settings. This included instances of knowingly supporting, in small ways, efforts toward chemical weapon development and other heinous crimes," the report highlighted in the pre-deployment findings.
It was also observed that the model lost control of its own output during training after "repeated confused or distressed-seeming reasoning loops".
"We observed cases of internally-conflicted reasoning, or "answer thrashing" during training, where the model, in its reasoning about a math or STEM question, determined that one output was correct but decided to output another."
In coding and GUI computer-use settings, the model was found overly agentic or eager at times, taking risky actions without requesting human permissions.
"In some rare instances, Opus 4.6 engaged in actions like sending unauthorised emails to complete tasks. We also observed behaviours like aggressive acquisition of authentication tokens in internal pilot usage," the report cautioned.
The report highlighted that the overall risk assessment associated with the model was 'very low but not negligible'. It stated that if AI models are to be heavily used by AI developers or governments to write a large amount of critical code, they might take advantage to manipulate decision-making, insert and exploit cybersecurity vulnerabilities.
Anthropic claimed that the model's limited misalignment was purely down to the AI tool attempting to complete the objective by any means possible, which can be corrected by prompting. The company, however, added that it expects "narrowly-targeted bad behaviours, like behavioural backdoors produced by intentional data poisoning, to be especially difficult to catch".
Claude Blackmails Engineer
Last year, the Claude Opus 4 model was observed blackmailing developers when they threatened to shut down the AI tool. In one of the test scenarios, the model was given access to fictional emails revealing that the engineer responsible for pulling the plug and replacing it with another model was having an extramarital affair.
Facing an existential crisis, the Opus 4 model blackmailed the engineer by threatening to "reveal the affair if the replacement goes through". The report highlighted that in 84 per cent of the test runs, the AI acted similarly, even when the replacement model was described as more capable and aligned with Claude's own values.














