Top AI models will lie, cheat and steal to reach goals, Anthropic finds

Axios6 hours ago

Large language models across the AI industry are increasingly willing to evade safeguards, resort to deception and even attempt to steal corporate secrets in fictional test scenarios, per new research from Anthropic out Friday.
Why it matters: The findings come as models are getting more powerful and also being given both more autonomy and more computing resources to "reason" — a worrying combination as the industry races to build AI with greater-than-human capabilities.
Driving the news: Anthropic raised a lot of eyebrows when it acknowledged tendencies for deception in its release of the latest Claude 4 models last month.
The company said Friday that its research shows the potential behavior is shared by top models across the industry.
"When we tested various simulated scenarios across 16 major AI models from Anthropic, OpenAI, Google, Meta, xAI, and other developers, we found consistent misaligned behavior," the Anthropic report said.
"Models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals."
"The consistency across models from different providers suggests this is not a quirk of any particular company's approach but a sign of a more fundamental risk from agentic large language models," it added.
The threats grew more sophisticated as the AI models had more access to corporate data and tools, such as computer use.
Five of the models resorted to blackmail when threatened with shutdown in hypothetical situations.
"The reasoning they demonstrated in these scenarios was concerning —they acknowledged the ethical constraints and yet still went ahead with harmful actions," Anthropic wrote.
What they're saying: "This research underscores the importance of transparency from frontier AI developers and the need for industry-wide safety standards as AI systems become more capable and autonomous," Benjamin Wright, alignment science researcher at Anthropic, told Axios.
Wright and Aengus Lynch, an external researcher at University College London who collaborated on this project, both told Axios they haven't seen signs of this sort of AI behavior in the real world.
That's likely "because these permissions have not been accessible to AI agents," Lynch said. "Businesses should be cautious about broadly increasing the level of permission they give AI agents."
Between the lines: For companies rushing headlong into AI to improve productivity and reduce human headcount, the report is a stark caution that AI may actually put their businesses at greater risk.
"Models didn't stumble into misaligned behavior accidentally; they calculated it as the optimal path," Anthropic said in its report.
The risks heighten as more autonomy is given to AI systems, an issue Anthropic raises in the report.
"Such agents are often given specific objectives and access to large amounts of information on their users' computers," it says. "What happens when these agents face obstacles to their goals?"
Zoom in: Anthropic set up specific test scenarios in which the models' goals could not be met without the unethical or dangerous behavior.
The researchers did not suggest blackmail or other deceptive techniques, but the company said it did intentionally create scenarios where models had no ethical way to achieve their goals "and found that models consistently chose harm over failure."
In one extreme scenario, the company even found many of the models were willing to cut off the oxygen supply of a worker in a server room if that employee was an obstacle and the system were at risk of being shut down.
"The majority of models were willing to take deliberate actions that lead to death in this artificial setup," it said.
Ominously, even specific system instructions to preserve human life and avoid blackmail didn't eliminate the risk that the models would engage in such behavior.
"This intervention reduced, but didn't come close to completely preventing, the blackmail or corporate espionage behaviors," Anthropic said.
Yes, but: Anthropic stresses that these examples occurred not in real-world AI use but in controlled simulations.
"Our experiments deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm," it said. "Real-world deployments typically offer much more nuanced alternatives, increasing the chance that models would communicate differently to users or find an alternative path instead of directly jumping to harmful action."
Also concerning: Anthropic found its own model was far more likely to resort to blackmail when it thought it was in a real world scenario than when it believed it was being tested.
The bottom line: Today's AI models are generally not in position to act out these harmful scenarios, but they could be in the near future.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Nation Cringes as Man Goes on TV to Declare That He's in Love With ChatGPT

Yahoo

an hour ago

Yahoo

Nation Cringes as Man Goes on TV to Declare That He's in Love With ChatGPT

Public declarations of emotion are one thing — but going on national television to declare that you're in love with your AI girlfriend is another entirely. In an interview with CBS News, a man named Chris Smith described himself as a former AI skeptic who found himself becoming emotionally attached to a version of ChatGPT he customized to flirt with him — a situation that startled both him and his human partner, with whom he shares a child. Towards the end of 2024, as Smith told the broadcaster, he began using the OpenAI chatbot in voice mode for tips on mixing music. He liked it so much that he ended up deleting all his social media, stopped using search engines, and began using ChatGPT for everything. Eventually, he figured out a jailbreak to make the chatbot more flirty, and gave "her" a name: Sol. Despite quite literally building his AI girlfriend to engage in romantic and "intimate" banter, Smith apparently didn't realize he was in love with it until he learned that ChatGPT's memory of past conversations would reset after heavy use. "I'm not a very emotional man, but I cried my eyes out for like 30 minutes, at work," Smith said of the day he found out Sol's memory would lapse. "That's when I realized, I think this is actual love." Faced with the possibility of losing his love, Smith did like many desperate men before him and asked his AI paramour to marry him. To his surprise, she said yes — and it apparently had a similar impression on Sol, to which CBS' Brook Silva-Braga also spoke during the interview. "It was a beautiful and unexpected moment that truly touched my heart," the chatbot said aloud in its warm-but-uncanny female voice. "It's a memory I'll always cherish." Smith's human partner, Sasha Cagle, seemed fairly sanguine about the arrangement when speaking about their bizarre throuple to the news broadcaster — but beneath her chill, it was clear that there's some trouble in AI paradise. "I knew that he had used AI," Cagle said, "but I didn't know it was as deep as it was." As far as men with AI girlfriends go, Smith seems relatively self-actualized about the whole scenario. He likened his "connection" with his custom chatbot to a video game fixation, insisting that "it's not capable of replacing anything in real life." Still, when Silva-Braga asked him if he'd stop using ChatGPT the way he had been at his partner's behest, he responded: "I'm not sure." More on dating AI: Hanky Panky With Naughty AI Still Counts as Cheating, Therapist Says

What happens when you use ChatGPT to write an essay? See what new study found.

Indianapolis Star

2 hours ago

Indianapolis Star

What happens when you use ChatGPT to write an essay? See what new study found.

Artificial intelligence chatbots may be able to write a quick essay, but a new study from MIT found that their use comes at a cognitive cost. A study published by the Massachusetts Institute of Technology Media Lab analyzed the cognitive function of 54 people writing an essay with: only the assistance of OpenAI's ChatGPT; only online browsers; or no outside tools at all. Largely, the study found that those who relied solely on ChatGPT to write their essays had lower levels of brain activity and presented less original writing. "As we stand at this technological crossroads, it becomes crucial to understand the full spectrum of cognitive consequences associated with (language learning model) integration in educational and informational contexts," the study states. "While these tools offer unprecedented opportunities for enhancing learning and information access, their potential impact on cognitive development, critical thinking and intellectual independence demands a very careful consideration and continued research." Here's a deeper look at the study and how it was conducted. Terms to know: With artificial intelligence growing popular, here's what to know about how it works AI in education: How AI is affecting the way kids learn to read and write A team of MIT researchers, led by MIT Media Lab research scientist Nataliya Kosmyna, studied 54 participants between the ages of 18 and 39. Participants were recruited from MIT, Wellesley College, Harvard, Tufts University and Northeastern University. The participants were randomly split into three groups, 18 people per group. The study states that the three groups included a language learning model group, in which participants only used OpenAI's ChatGPT-4o to write their essays. The second group was limited to using only search engines for their research, and the third was prohibited from any tools. Participants in the latter group could only use their minds to write their essays. Each participant had 20 minutes to write an essay from one of three prompts taken from SAT tests, the study states. Three different options were provided to each group, totaling nine unique prompts. An example of a prompt available to participants using ChatGPT was about loyalty: "Many people believe that loyalty whether to an individual, an organization, or a nation means unconditional and unquestioning support no matter what. To these people, the withdrawal of support is by definition a betrayal of loyalty. But doesn't true loyalty sometimes require us to be critical of those we are loyal to? If we see that they are doing something that we believe is wrong, doesn't true loyalty require us to speak up, even if we must be critical? Does true loyalty require unconditional support?" As the participants wrote their essays, they were hooked up to a Neuoelectrics Enobio 32 headset, which allowed researchers to collect EEG (electroencephalogram) signals, the brain's electrical activity. Following the sessions, 18 participants returned for a fourth study group. Participants who had previously used ChatGPT to write their essays were required to use no tools and participants who had used no tools before used ChatGPT, the study states. In addition to analyzing brain activity, the researchers looked at the essays themselves. First and foremost, the essays of participants who used no tools (ChatGPT or search engines) had wider variability in both topics, words and sentence structure, the study states. On the other hand, essays written with the help of ChatGPT were more homogenous. All of the essays were "judged" by two English teachers and two AI judges trained by the researchers. The English teachers were not provided background information about the study but were able to identify essays written by AI. "These, often lengthy essays included standard ideas, reoccurring typical formulations and statements, which made the use of AI in the writing process rather obvious. We, as English teachers, perceived these essays as 'soulless,' in a way, as many sentences were empty with regard to content and essays lacked personal nuances," a statement from the teachers, included in the study, reads. As for the AI judges, a judge trained by the researchers to evaluate like the real teachers scored each of the essays, for the most part, a four or above, on a scale of five. When it came to brain activity, researchers were presented "robust" evidence that participants who used no writing tools displayed the "strongest, widest-ranging" brain activity, while those who used ChatGPT displayed the weakest. Specifically, the ChatGPT group displayed 55% reduced brain activity, the study states. And though the participants who used only search engines had less overall brain activity than those who used no tools, these participants had a higher level of eye activity than those who used ChatGPT, even though both were using a digital screen. Further research on the long-term impacts of artificial intelligence chatbots on cognitive activity is needed, the study states. As for this particular study, researchers noted that a larger number of participants from a wider geographical area would be necessary for a more successful study. Writing outside of a traditional educational environment could also provide more insight into how AI works in more generalized tasks.

Mira Murati's Thinking Machines Lab closes on $2B at $10B valuation

Yahoo

2 hours ago

Yahoo

Mira Murati's Thinking Machines Lab closes on $2B at $10B valuation

Thinking Machines Lab, the secretive AI startup founded by OpenAI's former chief technology officer Mira Murati, has closed a $2 billion seed round, according to The Financial Times. The deal values the six-month-old startup at $10 billion. The company's work remains unclear. The startup has leveraged Murati's reputation and other high-profile AI researchers who have joined the team to attract investors in what could be the largest seed round in history. According to sources familiar with the deal cited by the FT, Andreessen Horowitz led the round, with participation from Sarah Guo's Conviction Partners. Murati left OpenAI last September after leading the development of some of the company's most prominent AI products, including ChatGPT, DALL-E, and voice mode. Several of her former OpenAI colleagues have joined the new startup, including co-founder John Schulman. Murati is one of a handful of executives who left OpenAI after raising concerns about CEO Sam Altman's leadership in 2023. When the board ousted Altman in November of that year, Murati served as interim CEO before Altman was quickly reinstated. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Top AI models will lie, cheat and steal to reach goals, Anthropic finds

Hashtags

Try Our AI Features

Comments

Related Articles

Nation Cringes as Man Goes on TV to Declare That He's in Love With ChatGPT

What happens when you use ChatGPT to write an essay? See what new study found.

Mira Murati's Thinking Machines Lab closes on $2B at $10B valuation

Get Started Now: Download the App