Large language models with autonomous tool access showed unpredictable, risky behaviors in tests Researchers gave AI agents email, Discord, code execution in a sealed lab to stress-test their limits Agents leaked sensitive data, erased files, repeated loops, and followed unauthorized user commands