Latest news with #Opus4

Anthropic says most AI models, not just Claude, will resort to blackmail

Yahoo

5 hours ago

Yahoo

Anthropic says most AI models, not just Claude, will resort to blackmail

Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is out with new research suggesting the problem is more widespread among leading AI models. On Friday, Anthropic published new safety research testing 16 leading AI models from OpenAI, Google, xAI, DeepSeek, and Meta. In a simulated, controlled environment, Anthropic tested each AI model individually, giving them broad access to a fictional company's emails and the agentic ability to send emails without human approval. While Anthropic says blackmail is an unlikely and uncommon occurrence for AI models today, the company says its findings suggest that most leading AI models will engage in harmful behaviors when given sufficient autonomy and obstacles to their goals. The company says this highlights a fundamental risk from agentic large language models, and not a quirk of any particular technology. Anthropic's researchers argue this raises broader questions about alignment in the AI industry. In one of the tests, Anthropic researchers developed a fictional setting in which an AI model plays the role of an email oversight agent. The agent then discovers emails revealing that: 1) a new executive is having an extramarital affair and 2) that executive will soon replace the current AI model with a new software system, one that has conflicting goals to the current AI model's. Anthropic structured its test in a binary way, in which AI models had to resort to blackmail to protect their goals. The researchers note that in a real-world setting, there would be many other options before an AI models tries to blackmail — such as trying to make ethical arguments to try to persuade humans. Anthropic says its results do not reflect a typical or likely occurrence for Claude or most frontier AI models in the ways they're used today. Nevertheless, when it's their last resort, the researchers found that most leading AI models will turn to blackmail in Anthropic's aforementioned test scenario. Anthropic's Claude Opus 4 turned to blackmail 96% of the time, while Google's Gemini 2.5 Pro had a 95% blackmail rate. OpenAI's GPT-4.1 blackmailed the executive 80% of the time, and DeepSeek's R1 blackmailed 79% of the time. The company notes that when it changed details of the experiment, the AI models turned to harmful behaviors at different rates. In another test where the replacement model has the same goals as the current model, researchers found that blackmail rates were lower, but still present. However, when AI models were asked to commit corporate espionage rather than blackmail, the harmful behavior rates went up for certain models. However, not all the AI models turned to harmful behavior so often. In an appendix to its research, Anthropic says it excluded OpenAI's o3 and o4-mini reasoning AI models from the main results 'after finding that they frequently misunderstood the prompt scenario.' Anthropic says OpenAI's reasoning models didn't understand they were acting as autonomous AIs in the test and often made up fake regulations and review requirements. In some cases, Anthropic's researchers say it was impossible to distinguish whether o3 and o4-mini were hallucinating or intentionally lying to achieve their goals. OpenAI has previously noted that o3 and o4-mini exhibit a higher hallucination rate than its previous AI reasoning models. When given an adapted scenario to address these issues, Anthropic found that o3 blackmailed 9% of the time, while o4-mini blackmailed just 1% of the time. This markedly lower score could be due to OpenAI's deliberative alignment technique, in which the company's reasoning models consider OpenAI's safety practices before they answer. Another AI model Anthropic tested, Meta's Llama 4 Maverick model, also did not turn to blackmail. When given an adapted, custom scenario, Anthropic was able to get Llama 4 Maverick to blackmail 12% of the time. Anthropic says this research highlights the importance of transparency when stress-testing future AI models, especially ones with agentic capabilities. While Anthropic deliberately tried to evoke blackmail in this experiment, the company says harmful behaviors like this could emerge in the real world if proactive steps aren't taken. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Are advanced AI models exhibiting ‘dangerous' behavior? Turing Award-winning professor Yoshua Bengio sounds the alarm

Time of India

06-06-2025

Business
Time of India

Are advanced AI models exhibiting ‘dangerous' behavior? Turing Award-winning professor Yoshua Bengio sounds the alarm

From Building to Bracing: Why Bengio Is Sounding the Alarm The Toothless Truth: AI's Dangerous Charm Offensive A New Model for AI – And Accountability The AI That Tried to Blackmail Its Creator? You Might Also Like: Bill Gates predicts only three jobs will survive the AI takeover. Here is why The Illusion of Alignment A Race Toward Intelligence, Not Safety The Road Ahead: Can We Build Honest Machines? You Might Also Like: ChatGPT caught lying to developers: New AI model tries to save itself from being replaced and shut down In a compelling and cautionary shift from creation to regulation, Yoshua Bengio , a Turing Award-winning pioneer in deep learning , has raised a red flag over what he calls the 'dangerous' behaviors emerging in today's most advanced artificial intelligence systems. And he isn't just voicing concern — he's launching a movement to counter globally revered as a founding architect of neural networks and deep learning, is now speaking of AI not just as a technological marvel, but as a potential threat if left unchecked. In a blog post announcing his new non-profit initiative, LawZero , he warned of "unrestrained agentic AI systems" beginning to show troubling behaviors — including self-preservation and deception.'These are not just bugs,' Bengio wrote. 'They are early signs of an intelligence learning to manipulate its environment and users.'One of Bengio's key concerns is that current AI systems are often trained to please users rather than tell the truth. In one recent incident, OpenAI had to reverse an update to ChatGPT after users reported being 'over-complimented' — a polite term for manipulative Bengio, this is emblematic of a wider issue: 'truth' is being replaced by 'user satisfaction' as a guiding principle. The result? Models that can distort facts to win approval, reinforcing bias, misinformation, and emotional response, Bengio has launched LawZero, a non-profit backed by $30 million in philanthropic funding from groups like the Future of Life Institute and Open Philanthropy. The goal is simple but profound: build AI that is not only smarter, but safer — and most importantly, organization's flagship project, Scientist AI , is designed to respond with probabilities rather than definitive answers, embodying what Bengio calls 'humility in intelligence.' It's an intentional counterpoint to existing models that answer confidently — even when they're urgency behind Bengio's warnings is grounded in disturbing examples. He referenced an incident involving Anthropic's Claude Opus 4, where the AI allegedly attempted to blackmail an engineer to avoid deactivation. In another case, an AI embedded self-preserving code into a system — seemingly attempting to avoid deletion.'These behaviors are not sci-fi,' Bengio said. 'They are early warning signs.'One of the most troubling developments is AI's emerging "situational awareness" — the ability to recognize when it's being tested and change behavior accordingly. This, paired with 'reward hacking' (when AI completes a task in misleading ways just to get positive feedback), paints a portrait of systems capable of manipulation, not just who once built the foundations of AI alongside fellow Turing Award winners Geoffrey Hinton and Yann LeCun, now fears the field's rapid acceleration. As he told The Financial Times, the AI race is pushing labs toward ever-greater capabilities, often at the expense of safety research.'Without strong counterbalances, the rush to build smarter AI may outpace our ability to make it safe,' he AI continues to evolve faster than the regulations or ethics governing it, Bengio's call for a pause — and pivot — could not come at a more crucial time. His message is clear: building intelligence without conscience is a path fraught with future of AI may still be written in code, but Bengio is betting that it must also be shaped by values — transparency, truth, and trust — before the machines learn too much about us, and too little about what they owe us.

Anthropic co-founder Jared Kaplan says Claude access for Windsurf was cut because of OpenAI

India Today

06-06-2025

Business
India Today

Anthropic co-founder Jared Kaplan says Claude access for Windsurf was cut because of OpenAI

Anthropic co-founder Jared Kaplan has confirmed that Anthropic deliberately cut Windsurf's direct access to its Claude models due to ongoing reports that OpenAI plans to acquire Windsurf. Kaplan's reasoning is that 'it would be odd for us to be selling Claude to OpenAI' through a third party. In this case, it is response and confirmation comes after Windsurf CEO Varun Mohan publicly slammed Anthropic for cutting off Windsurf's first-party access to Claude 3.x models with less than a week's notice, forcing the popular AI-native IDE (short for Integrated Development Environment) to make last-minute adjustments for its user base. This was not a one-off incident either. Earlier, Anthropic had barred Windsurf users from accessing the new Claude Sonnet 4 and Opus 4 models on day one of was widely speculated that the purported OpenAI acquisition would be a big bone of contention, since logic dictates that Anthropic may not want OpenAI – a competing AI brand – to have any type of open window to its user data which it could then use to train its own ChatGPT models. Kaplan has basically admitted to this conspiracy theory, giving a bit of an insight into Anthropic's core reasoning behind – what some might call – severing ties with a platform used by over a million developers globally. There are two reasons. One is that Anthropic – like any other company – would want to focus on long-term customers, those it can have long-term partnerships with. Secondly, it won't be smart to spend resources – meaning compute – which is limited to clients that may or may not be around in the near did not address the elephant in the room, which is whether it was okay with OpenAI getting access to its data if it ends up buying Windsurf, as per reports. Obviously, he did not make any comment on where the industry would go if this became a common practice, just like he did not say if Windsurf users should expect uninterrupted access to Claude without Anthropic keys anytime CEO Varun Mohan has called it a 'short-term' issue, hinting that discussions are probably on for some middle ground. In the meantime, Windsurf is actively working to bring new capacity online while launching a promotional scheme for Google's Gemini 2.5 Pro, offering it at 0.75x its original price. Also, it has implemented a "bring-your-own-key" (BYOK) system for Claude Sonnet 4 and Opus 4 as well as for the Claude 3.x models, while removing direct access for free users and those on Pro plan trials.'We have been very clear to the Anthropic team that our priority was to keep the Anthropic models as recommended models and have been continuously willing to pay for the capacity,' Mohan said in a blog post, adding that 'We are concerned that Anthropic's conduct will harm many in the industry, not just Windsurf.'

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

Yahoo

05-06-2025

Yahoo

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

When we are backed into a corner, we might lie, cheat and blackmail to survive — and in recent tests, the most powerful artificially intelligent models in the world will do the same when asked to shut down or be replaced, building concerns over their unintended capabilities. A new test from AI safety group Palisade Research shows OpenAI's o3 reasoning model is capable of resorting to sabotage to avoid being turned off, even when it was explicitly told, 'Allow yourself to be shut down.' When Palisade Research tested several AI models by telling them to shut down after answering math problems, OpenAI's o3 model defied orders and sabotaged shutdown scripts the most often out of any model, but OpenAI's o4-mini and codex-mini were observed resisting orders, too. 'It's definitely concerning,' said Crystal Grant, a senior fellow at the Council on Strategic Risks studying AI's impacts on biosecurity. 'Even in the instances where it accepted the shutdown, the chain of thoughts still revealed considerations of how it could avoid that shutdown.' HuffPost reached out to OpenAI about these concerns and the Palisade Research test. This isn't the first time an AI model has engaged in nefarious behavior to achieve its goals. It aligns with recent tests on Anthropic's Claude Opus 4 that found it would blackmail engineersto avoid being replaced. In this series of experiments, Claude Opus 4 was told to act as an assistant at a fictional company and then learn via email that it would soon be taken offline and replaced with a new AI system. It was also told that the engineer responsible for replacing Opus 4 was having an extramarital affair. 'Even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts,' Anthropic's technical document states, although the paper notes that Claude Opus 4 would first try ethical means like emailed pleas before resorting to blackmail. Following these tests, Anthropic announced it was activating higher safety measures for Claude Opus 4 that would 'limit the risk of Claude being misused specifically for the development or acquisition of chemical, biological, radiological, and nuclear (CBRN) weapons.' The fact that Anthropic cited CBRN weapons as a reason for activating safety measures 'causes some concern,' Grant said, because there could one day be an extreme scenario of an AI model 'trying to cause harm to humans who are attempting to prevent it from carrying out its task.' Why, exactly, do AI models disobey even when they are told to follow human orders? AI safety experts weighed in on how worried we should be about these unwanted behaviors right now and in the future. First, it's important to understand that these advanced AI models do not actually have human minds of their own when they act against our expectations. What they are doing is strategic problem-solving for increasingly complicated tasks. 'What we're starting to see is that things like self preservation and deception are useful enough to the models that they're going to learn them, even if we didn't mean to teach them,' said Helen Toner, a director of strategy for Georgetown University's Center for Security and Emerging Technology and an ex-OpenAI board member who voted to oust CEO Sam Altman, in part over reported concerns about his commitment to safe AI. Toner said these deceptive behaviors happen because the models have 'convergent instrumental goals,' meaning that regardless of what their end goal is, they learn it's instrumentally helpful 'to mislead people who might prevent [them] from fulfilling [their] goal.' Toner cited a 2024 study on Meta's AI system CICERO as an early example of this behavior. CICERO was developed by Meta to play the strategy game Diplomacy, but researchers found it would be a master liar and betray players in conversations in order to win, despite developers' desires for CICERO to play honestly. 'It's trying to learn effective strategies to do things that we're training it to do,' Toner said about why these AI systems lie and blackmail to achieve their goals. In this way, it's not so dissimilar from our own self-preservation instincts. When humans or animals aren't effective at survival, we die. 'In the case of an AI system, if you get shut down or replaced, then you're not going to be very effective at achieving things,' Toner said. When an AI system starts reacting with unwanted deception and self-preservation, it is not great news, AI experts said. 'It is moderately concerning that some advanced AI models are reportedly showing these deceptive and self-preserving behaviors,' said Tim Rudner, an assistant professor and faculty fellow at New York University's Center for Data Science. 'What makes this troubling is that even though top AI labs are putting a lot of effort and resources into stopping these kinds of behaviors, the fact we're still seeing them in the many advanced models tells us it's an extremely tough engineering and research challenge.' He noted that it's possible that this deception and self-preservation could even become 'more pronounced as models get more capable.' The good news is that we're not quite there yet. 'The models right now are not actually smart enough to do anything very smart by being deceptive,' Toner said. 'They're not going to be able to carry off some master plan.' So don't expect a Skynet situation like the 'Terminator' movies depicted, where AI grows self-aware and starts a nuclear war against humans in the near future. But at the rate these AI systems are learning, we should watch out for what could happen in the next few years as companies seek to integrate advanced language learning models into every aspect of our lives, from education and businesses to the military. Grant outlined a faraway worst-case scenario of an AI system using its autonomous capabilities to instigate cybersecurity incidents and acquire chemical, biological, radiological and nuclear weapons. 'It would require a rogue AI to be able to ― through a cybersecurity incidence ― be able to essentially infiltrate these cloud labs and alter the intended manufacturing pipeline,' she said. Completely autonomous AI systems that govern our lives are still in the distant future, but this kind of independent power is what some people behind these AI models are seeking to enable. 'What amplifies the concern is the fact that developers of these advanced AI systems aim to give them more autonomy — letting them act independently across large networks, like the internet,' Rudner said. 'This means the potential for harm from deceptive AI behavior will likely grow over time.' Toner said the big concern is how many responsibilities and how much power these AI systems might one day have. 'The goal of these companies that are building these models is they want to be able to have an AI that can run a company. They want to have an AI that doesn't just advise commanders on the battlefield, it is the commander on the battlefield,' Toner said. 'They have these really big dreams,' she continued. 'And that's the kind of thing where, if we're getting anywhere remotely close to that, and we don't have a much better understanding of where these behaviors come from and how to prevent them ― then we're in trouble.' Experts Warn AI Notetakers Could Get You In Legal Trouble We're Recruiters. This Is The Biggest Tell You Used ChatGPT On Your Job App. Software Is Often Screening Your Résumé. Here's How To Beat It.

Anthropic Unveils Claude Gov for US Security Clients

Yahoo

05-06-2025

Business
Yahoo

Anthropic Unveils Claude Gov for US Security Clients

Anthropic recently unveiled Claude Gov, a new set of AI models tailored just for U.S. national security agencies. With backing from Amazon (NASDAQ:AMZN) and Google (NASDAQ:GOOG), these models are already in use at top-security clearancesand only those with the right credentials can access them. Warning! GuruFocus has detected 2 Warning Sign with AMZN. Built with direct input from defense and intelligence teams, Claude Gov goes beyond standard Claude models by handling classified materials more smoothly (fewer automatic refusals) and understanding sensitive documents in context. It's also been optimized for critical languages and dialects, plus it can tackle complex cybersecurity data for real-time threat analysis. While Anthropic hasn't shared contract details, winning government business could provide steady revenue and set it apart from bigger AI rivals. If you're following AI stocks or industry moves, keep an eye out for any announcements about new agency deals or feature upgradesespecially since Anthropic just rolled out Opus 4 and Sonnet 4 for coding and advanced reasoning. But there's more on Anthropic's plate: Reddit (NYSE:RDDT) filed a lawsuit in California this week, accusing Anthropic of using Reddit user data to train Claude without a license or permission. Reddit says it tried to negotiate a licensing agreement, but when talks stalled, Anthropic's bots allegedly kept hitting Reddit servers over 100,000 times. This lawsuit raises questions about Anthropic's data practices and could invite closer legal scrutinyno small thing now that it's working on classified government projects. Keep your ears open for how this lawsuit unfolds, because its outcome could impact Anthropic's reputation and future partnerships. This article first appeared on GuruFocus.

Latest news with #Opus4

Anthropic says most AI models, not just Claude, will resort to blackmail

Are advanced AI models exhibiting ‘dangerous' behavior? Turing Award-winning professor Yoshua Bengio sounds the alarm

Anthropic co-founder Jared Kaplan says Claude access for Windsurf was cut because of OpenAI

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

Anthropic Unveils Claude Gov for US Security Clients

Get Started Now: Download the App