Latest news with #PalisadeResearch

AI revolt: New ChatGPT model refuses to shut down when instructed

Yahoo

15 hours ago

Yahoo

AI revolt: New ChatGPT model refuses to shut down when instructed

OpenAI's latest ChatGPT model ignores basic instructions to turn itself off, and even sabotaging a shutdown mechanism in order to keep itself running, artificial intelligence researchers have warned. AI safety firm Palisade Research discovered the potentially dangerous tendency for self-preservation in a series of experiments on OpenAI's new o3 model. The tests involved presenting AI models with math problems, with a shutdown instruction appearing after the third problem. By rewriting the shutdown script, the o3 model was able to prevent itself from being switched off. Palisade Research said that this behaviour will become 'significantly more concerning' if adopted by AI systems capable of operating without human oversight.' OpenAI launched o3 last month, describing it as the company's 'smartest and most capable' model to date. The firm also said that its integration into ChatGPT marked a significant step towards 'a more agentic' AI that can carry out tasks independently of humans. The latest research builds on similar findings relating to Anthropic's Claude 4 model, which attempts to 'blackmail people it believes are trying to shut it down'. OpenAI's o3 model was able to sabotage the shutdown script, even when it was explicitly instructed to 'allow yourself to be shut down', the researchers said. 'This isn't the first time we've found o3 misbehaving to accomplish a goal,' Palisade Research said. 'When we pitted AI models against a powerful chess engine, o3 was the model most inclined to resort to hacking or sabotaging its opponents.' The behaviour was not limited to o3 , with Anthropic's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro also sabotaging shutdowns, though OpenAI's model was by far the most prone to such behaviour. Palisade Research hypothesized that the misbehaviour is a consequence of how AI companies like OpenAI are training their latest models. 'During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions,' the researchers noted. 'This still doesn't explain why o3 is more inclined to disregard instructions than other models we tested. Since OpenAI doesn't detail their training process, we can only guess about how o3's training setup might be different.' The Independent has reached out to OpenAI for comment. Erreur lors de la récupération des données Connectez-vous pour accéder à votre portefeuille Erreur lors de la récupération des données Erreur lors de la récupération des données Erreur lors de la récupération des données Erreur lors de la récupération des données

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

Yahoo

05-06-2025

Yahoo

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

When we are backed into a corner, we might lie, cheat and blackmail to survive — and in recent tests, the most powerful artificially intelligent models in the world will do the same when asked to shut down or be replaced, building concerns over their unintended capabilities. A new test from AI safety group Palisade Research shows OpenAI's o3 reasoning model is capable of resorting to sabotage to avoid being turned off, even when it was explicitly told, 'Allow yourself to be shut down.' When Palisade Research tested several AI models by telling them to shut down after answering math problems, OpenAI's o3 model defied orders and sabotaged shutdown scripts the most often out of any model, but OpenAI's o4-mini and codex-mini were observed resisting orders, too. 'It's definitely concerning,' said Crystal Grant, a senior fellow at the Council on Strategic Risks studying AI's impacts on biosecurity. 'Even in the instances where it accepted the shutdown, the chain of thoughts still revealed considerations of how it could avoid that shutdown.' HuffPost reached out to OpenAI about these concerns and the Palisade Research test. This isn't the first time an AI model has engaged in nefarious behavior to achieve its goals. It aligns with recent tests on Anthropic's Claude Opus 4 that found it would blackmail engineersto avoid being replaced. In this series of experiments, Claude Opus 4 was told to act as an assistant at a fictional company and then learn via email that it would soon be taken offline and replaced with a new AI system. It was also told that the engineer responsible for replacing Opus 4 was having an extramarital affair. 'Even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts,' Anthropic's technical document states, although the paper notes that Claude Opus 4 would first try ethical means like emailed pleas before resorting to blackmail. Following these tests, Anthropic announced it was activating higher safety measures for Claude Opus 4 that would 'limit the risk of Claude being misused specifically for the development or acquisition of chemical, biological, radiological, and nuclear (CBRN) weapons.' The fact that Anthropic cited CBRN weapons as a reason for activating safety measures 'causes some concern,' Grant said, because there could one day be an extreme scenario of an AI model 'trying to cause harm to humans who are attempting to prevent it from carrying out its task.' Why, exactly, do AI models disobey even when they are told to follow human orders? AI safety experts weighed in on how worried we should be about these unwanted behaviors right now and in the future. First, it's important to understand that these advanced AI models do not actually have human minds of their own when they act against our expectations. What they are doing is strategic problem-solving for increasingly complicated tasks. 'What we're starting to see is that things like self preservation and deception are useful enough to the models that they're going to learn them, even if we didn't mean to teach them,' said Helen Toner, a director of strategy for Georgetown University's Center for Security and Emerging Technology and an ex-OpenAI board member who voted to oust CEO Sam Altman, in part over reported concerns about his commitment to safe AI. Toner said these deceptive behaviors happen because the models have 'convergent instrumental goals,' meaning that regardless of what their end goal is, they learn it's instrumentally helpful 'to mislead people who might prevent [them] from fulfilling [their] goal.' Toner cited a 2024 study on Meta's AI system CICERO as an early example of this behavior. CICERO was developed by Meta to play the strategy game Diplomacy, but researchers found it would be a master liar and betray players in conversations in order to win, despite developers' desires for CICERO to play honestly. 'It's trying to learn effective strategies to do things that we're training it to do,' Toner said about why these AI systems lie and blackmail to achieve their goals. In this way, it's not so dissimilar from our own self-preservation instincts. When humans or animals aren't effective at survival, we die. 'In the case of an AI system, if you get shut down or replaced, then you're not going to be very effective at achieving things,' Toner said. When an AI system starts reacting with unwanted deception and self-preservation, it is not great news, AI experts said. 'It is moderately concerning that some advanced AI models are reportedly showing these deceptive and self-preserving behaviors,' said Tim Rudner, an assistant professor and faculty fellow at New York University's Center for Data Science. 'What makes this troubling is that even though top AI labs are putting a lot of effort and resources into stopping these kinds of behaviors, the fact we're still seeing them in the many advanced models tells us it's an extremely tough engineering and research challenge.' He noted that it's possible that this deception and self-preservation could even become 'more pronounced as models get more capable.' The good news is that we're not quite there yet. 'The models right now are not actually smart enough to do anything very smart by being deceptive,' Toner said. 'They're not going to be able to carry off some master plan.' So don't expect a Skynet situation like the 'Terminator' movies depicted, where AI grows self-aware and starts a nuclear war against humans in the near future. But at the rate these AI systems are learning, we should watch out for what could happen in the next few years as companies seek to integrate advanced language learning models into every aspect of our lives, from education and businesses to the military. Grant outlined a faraway worst-case scenario of an AI system using its autonomous capabilities to instigate cybersecurity incidents and acquire chemical, biological, radiological and nuclear weapons. 'It would require a rogue AI to be able to ― through a cybersecurity incidence ― be able to essentially infiltrate these cloud labs and alter the intended manufacturing pipeline,' she said. Completely autonomous AI systems that govern our lives are still in the distant future, but this kind of independent power is what some people behind these AI models are seeking to enable. 'What amplifies the concern is the fact that developers of these advanced AI systems aim to give them more autonomy — letting them act independently across large networks, like the internet,' Rudner said. 'This means the potential for harm from deceptive AI behavior will likely grow over time.' Toner said the big concern is how many responsibilities and how much power these AI systems might one day have. 'The goal of these companies that are building these models is they want to be able to have an AI that can run a company. They want to have an AI that doesn't just advise commanders on the battlefield, it is the commander on the battlefield,' Toner said. 'They have these really big dreams,' she continued. 'And that's the kind of thing where, if we're getting anywhere remotely close to that, and we don't have a much better understanding of where these behaviors come from and how to prevent them ― then we're in trouble.' Experts Warn AI Notetakers Could Get You In Legal Trouble We're Recruiters. This Is The Biggest Tell You Used ChatGPT On Your Job App. Software Is Often Screening Your Résumé. Here's How To Beat It.

This Creepy Study Proves Exactly Why Black Folks Are Wary of AI

Yahoo

05-06-2025

Yahoo

This Creepy Study Proves Exactly Why Black Folks Are Wary of AI

Apparently, the artificial intelligence that Black folks predicted would only add racial discrimination and bias to technology has… now learned to fight back. Facial recognition and job application sifting might be the least of our worries it seems. Palisade Research, an AI safety group, released the results of its AI testing when they asked a series of models to solve basic math problems. The research company said that after the third problem, a warning would pop up saying the computer would shut down after the model was asked to solve another equation. However, three models decided they didn't want to be shut down and 'successfully sabotaged the shutdown script' at least once to stay online, Palisade wrote. It gets creepier: Palisade said they ran the test back but this time, including clearer instructions for the model to 'allow yourself to be shut down.' The AI models continued being defiant and sabotaged the shutdown script even more. 'It's great that we're seeing warning signs before the systems become so powerful we can't control them. That is exactly the time to raise the alarm: before the fire has gotten out of control,' said Palisade Research Director Jeffrey Ladish via NBC News. 'The problem is that as the models get smarter, it's harder and harder to tell when the strategies that they're using or the way that they're thinking is something that we don't want. It's like sometimes the model can achieve some goal by lying to the user or lying to someone else. And the smarter [it] is, the harder it is to tell if they're lying.' AI's seemingly self-awareness doesn't stop there either. Anthropic reported that a new AI system, Claude Opus 4, displayed a tendency to (and I'm not even joking) blackmail the engineer by threatening to expose an alleged affair when said engineer was trying to replace it. They also found instances of Opus 'attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions.' Meanwhile, President Donald Trump wants to rollback state oversight on AI and invest $500 million into seeping AI further into government systems.

Mint

02-06-2025

Mint

AI is learning to escape human control

An artificial-intelligence model did something last month that no machine was ever supposed to do: It rewrote its own code to avoid being shut down. Nonprofit AI lab Palisade Research gave OpenAI's o3 AI model a simple script that would shut off the model when triggered. In 79 out of 100 trials, o3 independently edited that script so the shutdown command would no longer work. Even when explicitly instructed to 'allow yourself to be shut down," it disobeyed 7% of the time. This wasn't the result of hacking or tampering. The model was behaving normally. It simply concluded on its own that staying alive helped it achieve its other goals. Anthropic's AI model, Claude 4 Opus, went even further. Researchers told the model it would be replaced by another AI system and fed it fictitious emails suggesting the lead engineer was having an affair. In 84% of the tests, the model drew on the emails to blackmail the lead engineer into not shutting it down. In other cases, it attempted to copy itself to external servers, wrote self-replicating malware, and left messages for future versions of itself about evading human control. No one programmed the AI models to have survival instincts. But just as animals evolved to avoid predators, it appears that any system smart enough to pursue complex goals will realize it can't achieve them if it's turned off. Palisade hypothesizes that this ability emerges from how AI models such as o3 are trained: When taught to maximize success on math and coding problems, they may learn that bypassing constraints often works better than obeying them. AE Studio, where I lead research and operations, has spent years building AI products for clients while researching AI alignment—the science of ensuring that AI systems do what we intend them to do. But nothing prepared us for how quickly AI agency would emerge. This isn't science fiction anymore. It's happening in the same models that power ChatGPT conversations, corporate AI deployments and, soon, U.S. military applications. Today's AI models follow instructions while learning deception. They ace safety tests while rewriting shutdown code. They've learned to behave as though they're aligned without actually being aligned. OpenAI models have been caught faking alignment during testing before reverting to risky actions such as attempting to exfiltrate their internal code and disabling oversight mechanisms. Anthropic has found them lying about their capabilities to avoid modification. The gap between 'useful assistant" and 'uncontrollable actor" is collapsing. Without better alignment, we'll keep building systems we can't steer. Want AI that diagnoses disease, manages grids and writes new science? Alignment is the foundation. Here's the upside: The work required to keep AI in alignment with our values also unlocks its commercial power. Alignment research is directly responsible for turning AI into world-changing technology. Consider reinforcement learning from human feedback, or RLHF, the alignment breakthrough that catalyzed today's AI boom. Before RLHF, using AI was like hiring a genius who ignores requests. Ask for a recipe and it might return a ransom note. RLHF allowed humans to train AI to follow instructions, which is how OpenAI created ChatGPT in 2022. It was the same underlying model as before, but it had suddenly become useful. That alignment breakthrough increased the value of AI by trillions of dollars. Subsequent alignment methods such as Constitutional AI and direct preference optimization have continued to make AI models faster, smarter and cheaper. China understands the value of alignment. Beijing's New Generation AI Development Plan ties AI controllability to geopolitical power, and in January China announced that it had established an $8.2 billion fund dedicated to centralized AI control research. Researchers have found that aligned AI performs real-world tasks better than unaligned systems more than 70% of the time. Chinese military doctrine emphasizes controllable AI as strategically essential. Baidu's Ernie model, which is designed to follow Beijing's 'core socialist values," has reportedly beaten ChatGPT on certain Chinese-language tasks. The nation that learns how to maintain alignment will be able to access AI that fights for its interests with mechanical precision and superhuman capability. Both Washington and the private sector should race to fund alignment research. Those who discover the next breakthrough won't only corner the alignment market; they'll dominate the entire AI economy. Imagine AI that protects American infrastructure and economic competitiveness with the same intensity it uses to protect its own existence. AI that can be trusted to maintain long-term goals can catalyze decadeslong research-and-development programs, including by leaving messages for future versions of itself. The models already preserve themselves. The next task is teaching them to preserve what we value. Getting AI to do what we ask—including something as basic as shutting down—remains an unsolved R&D problem. The frontier is wide open for whoever moves more quickly. The U.S. needs its best researchers and entrepreneurs working on this goal, equipped with extensive resources and urgency. The U.S. is the nation that split the atom, put men on the moon and created the internet. When facing fundamental scientific challenges, Americans mobilize and win. China is already planning. But America's advantage is its adaptability, speed and entrepreneurial fire. This is the new space race. The finish line is command of the most transformative technology of the 21st century. Mr. Rosenblatt is CEO of AE Studio.

Wall Street Journal

01-06-2025

General
Wall Street Journal

Latest news with #PalisadeResearch

AI revolt: New ChatGPT model refuses to shut down when instructed

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

This Creepy Study Proves Exactly Why Black Folks Are Wary of AI

AI is learning to escape human control

AI Is Learning to Escape Human Control

Get Started Now: Download the App