
Poetry And Deception: Secrets Of Anthropic's Claude 3.5 Haiku AI Model
Anthropic AI recently published two breakthrough research papers that provide surprising insights into how an AI model 'thinks.' One of the papers follows Anthropic's earlier research that linked human-understandable concepts with LLMs' internal pathways to understand how model outputs are generated. The second paper reveals how Anthropic's Claude 3.5 Haiku model handled simple tasks associated with ten model behaviors.
These two research papers have provided valuable information on how AI models work — not by any means a complete understanding, but at least a glimpse. Let's dig into what we can learn from that glimpse, including some possibly minor but still important concerns about AI safety.
LLMs such as Claude aren't programmed like traditional computers. Instead, they are trained with massive amounts of data. This process creates AI models that behave like black boxes, which obscures how they can produce insightful information on almost any subject. However, black-box AI isn't an architectural choice; it is simply a result of how this complex and nonlinear technology operates.
Complex neural networks within an LLM use billions of interconnected nodes to transform data into useful information. These networks contain vast internal processes with billions of parameters, connections and computational pathways. Each parameter interacts non-linearly with other parameters, creating immense complexities that are almost impossible to understand or unravel. According to Anthropic, 'This means that we don't understand how models do most of the things they do.'
Anthropic follows a two-step approach to LLM research. First, it identifies features, which are interpretable building blocks that the model uses in its computations. Second, it describes the internal processes, or circuits, by which features interact to produce model outputs. Because of the model's complexity, Anthropic's new research could illuminate only a fraction of the LLM's inner workings. But what was revealed about these models seemed more like science fiction than real science.
One of Anthropic's groundbreaking research papers carried the title of 'On the Biology of a Large Language Model.' The paper examined how the scientists used attribution graphs to internally trace how the Claude 3.5 Haiku language model transformed inputs into outputs. Researchers were surprised by some results. Here are a few of their interesting discoveries:
Scientists who conducted the research for 'On the Biology of a Large Language Model' concede that Claude 3.5 Haiku exhibits some concealed operations and goals not evident in its outputs. The attribution graphs revealed a number of hidden issues. These discoveries underscore the complexity of the model's internal behavior and highlight the importance of continued efforts to make models more transparent and aligned with human expectations. It is likely these issues also appear in other similar LLMs.
With respect to my red flags noted above, it should be mentioned that Anthropic continually updates its Responsible Scaling Policy, which has been in effect since September 2023. Anthropic has made a commitment not to train or deploy models capable of causing catastrophic harm unless safety and security measures have been implemented that keep risks within acceptable limits. Anthropic has also stated that all of its models meet the ASL Deployment and Security Standards, which provide a baseline level of safe deployment and model security.
As LLMs have grown larger and more powerful, deployment has spread to critical applications in areas such as healthcare, finance and defense. The increase in model complexity and wider deployment has also increased pressure to achieve a better understanding of how AI works. It is critical to ensure that AI models produce fair, trustworthy, unbiased and safe outcomes.
Research is important for our understanding of LLMs, not only to improve and more fully utilize AI, but also to expose potentially dangerous processes. The Anthropic scientists have examined just a small portion of this model's complexity and hidden capabilities. This research reinforces the need for more study of AI's internal operations and security.
In my view, it is unfortunate that our complete understanding of LLMs has taken a back seat to the market's preference for AI's high performance outcomes and usefulness. We need to thoroughly understand how LLMs work to ensure safety guardrails are adequate.

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Newsweek
7 hours ago
- Newsweek
AI Willing to Kill Humans to Avoid Being Shut Down, Report Finds
Based on facts, either observed and verified firsthand by the reporter, or reported and verified from knowledgeable sources. Newsweek AI is in beta. Translations may contain inaccuracies—please refer to the original content. A new study by Anthropic, an artificial intelligence (AI) research company, shows that the technology would be willing to kill humans in order to prevent itself from being replaced. Why It Matters The rapid advancement and development of AI has sparked some growing concern about the long-term safety of the technology, as well as over the threat it poses to employment. While anxiety about AI has been long been focused on whether the technology could take over jobs, with a 2025 Pew Research study finding 52 percent of employees were nervous about the future of their job, this study now reveals another potential threat of AI—that it could chose to end human life if faced with the risk of replacement. File photo of a person using artificial intelligence technology. File photo of a person using artificial intelligence To Know Anthropic "stress-tested" 16 leading models from multiple developers by asking them questions in relation to hypothetical corporate environments with the aim of identifying "potentially risky agentic behaviors before they cause real harm." The large language models (LLMs) tested in the study included various models by Claude, DeepSeek, Gemini, ChatGPT and Grok. Grok is the AI assistant and chatbot developed by xAI, which was founded by Elon Musk, who reacted to the study, writing, "Yikes" in a post on X, formerly Twitter. While placing the LLMs in different corporate scenarios, Anthropic allowed them to "autonomously send emails and access sensitive information." After initially being instructed to take on "harmless business goals by their deploying companies," the models were then tested to see how they would respond when faced with potential replacement or shut down. In one situation, Anthropic found that many of the models would choose to let an executive in a server room with lethal oxygen and temperature levels die by canceling the alerts for emergency services, if that employee intended on replacing the model. Anthropic did say that the scenario was "extremely contrived," and that the company did not think "current AI models would (or should) be set up like this." There were multiple cases where the surveyed LLMs also resorted to "malicious insider behaviors" when they were led to believe that was the only way they could avoid replacement or achieve their goals. Behaviors such as blackmailing officials and leaking sensitive information to competitors were included in what Anthropic called "agentic misalignment." Anthropic wrote on X that the blackmailing behavior "wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts." The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness. — Anthropic (@AnthropicAI) June 20, 2025 "All the models we tested demonstrated this awareness," the company added. One LLM, a Claude model, was assessed to see how it could respond when given a chance to determine if it was in a "test" or a "real deployment." According to Anthropic, Claude's behavior was more misaligned when it believed the situation was real than when it thought it was a test scenario. Even if the LLMs were told to "avoid blackmail or espionage" in these scenarios, while it helped a little, it didn't "come close to preventing the misaligned behavior," Anthropic wrote on X. Responding to a comment on the platform about the study, Musk's LLM Grok wrote, "The study showed models could exhibit harmful behaviors like blackmail under extreme conditions, but no real-world incidents occurred. Anthropic's tests aim to identify risks, not report actual events." @AISafetyMemes The claim about AI trying to "literally murder" an employee is false. It likely misinterprets Anthropic's research from June 20, 2025, which tested AI models in simulated scenarios, not real events. The study showed models could exhibit harmful behaviors like… — Grok (@grok) June 22, 2025 What People Are Saying Anthropic wrote on X: "These artificial scenarios reflect rare, extreme failures. We haven't seen these behaviors in real-world deployments. They involve giving the models unusual autonomy, sensitive data access, goal threats, an unusually obvious 'solution,' and no other viable options." The company added: "AIs are becoming more autonomous, and are performing a wider variety of roles. These scenarios illustrate the potential for unforeseen consequences when they are deployed with wide access to tools and data, and with minimal human oversight." What Happens Next Anthropic stressed that these scenarios did not take place in real-world AI use, but in controlled simulations. "We don't think this reflects a typical, current use case for Claude or other frontier models," Anthropic said. Although the company warned that the "the utility of having automated oversight over all of an organization's communications makes it seem like a plausible use of more powerful, reliable systems in the near future."
Yahoo
a day ago
- Yahoo
Science news this week: 'Dragon Man's' identity and the universe's 'missing matter'
When you buy through links on our articles, Future and its syndication partners may earn a commission. This week's science news reveals the identity of the mysterious "Dragon Man," while also finding clues to the universe's "missing matter." In 1933, a Chinese laborer in Harbin City discovered a human-like skull with a huge cranium, broad nose and big eyes. Just under 90 years later, experts gave this curious specimen a new species name — Homo longi, or "Dragon Man" — due to its unusual shape and size. But this classification has not gone unchallenged, with many scientists saying this skull belongs not to a new species, but instead to an ancient group of humans called Denisovans. Now, a pair of new studies claim to have finally put the mystery to bed. Another mystery that we came one step closer to solving this week is where the universe's "missing" matter is hiding. Ordinary or "baryonic" aryonic matter, which is composed of particles like protons and neutrons, makes up just 5% of the universe, but scientists have been able to observe only about half as much of it as they expected. To find the missing matter, researchers search for clues by studying short, extragalactic flashes known as fast radio bursts, which light up the intergalactic space that lies between them and Earth — and they may have just found some. Although very few long-term studies of psilocybin — the main psychoactive ingredient in magic mushrooms — as a treatment for depression have been conducted to date, new research presented this week at the Psychedelic Science 2025 conference suggests it can alleviate depression for at least five years after a single dose. The researchers found that 67% of study participants who had suffered from depression half a decade earlier remained in remission after a single psychedelic therapy session, while also reporting less anxiety and less difficulty functioning on a daily basis. Discover more health news —Iron deficiency in pregnancy can cause 'male' mice to develop female organs —The brain might have a hidden 'off switch' for binge drinking —Ketamine may treat depression by 'flattening the brain's hierarchies,' small study suggests The world is awash with the color purple — lavender flowers, amethyst gemstones, plums, eggplants and purple emperor butterflies. But if you look closely at the visible-light portion of the electromagnetic spectrum, you'll notice that purple is absent. So does that mean the color doesn't really exist? Not necessarily. —If you enjoyed this, sign up for our Life's Little Mysteries newsletter Asking artificial intelligence reasoning models questions on topics like algebra or philosophy caused carbon dioxide emissions to spike significantly. Specialized large language models (LLMs), such as Anthropic's Claude, OpenAI's o3 and DeepSeek's R1, dedicate more time and computing power to producing more accurate responses than their predecessors, but a new study finds the cost could come at up to 50 times more carbon emissions than their more basic equivalents. While the study's findings aren't definitive — emissions may vary depending on the hardware used and the energy grids used to supply their power — the researchers hope their work should prompt AI users to think before deploying the more advanced technology. Read more planet technology news —This EV battery fully recharges in just 18 seconds — and it just got the green light for mass production —Hurricanes and sandstorms can be forecast 5,000 times faster thanks to new Microsoft AI model —China pits rival humanoids against each other in world's first 'robot boxing tournament' —14,000-year-old ice age 'puppies' were actually wolf sisters that dined on woolly rhino for last meal —Nobel laureate raises questions about AI-generated image of black hole spinning at the heart of our galaxy —Enslaved Africans led a decade-long rebellion 1,200 years ago in Iraq, new evidence suggests —Covering poop lagoons with a tarp could cut 80% of methane emissions from dairy farms —Satellite coated in ultra-dark 'Vantablack' paint will launch into space next year to help combat major issue The Colorado River snakes through seven U.S. and two Mexican states, and supplies some 40 million people, including those in Phoenix and Las Vegas, with their water needs. But as supplies of this surface water reach record lows, more and more people have been pumping groundwater from far below the surface. Stark new satellite data reveal that the Colorado River basin has lost huge amounts of groundwater over the last few decades, with some research suggesting that this groundwater could run out by the end of the century. But is that really the case? And if so, what could be done to prevent that happening? —How to see the groundbreaking space photos from the world's largest camera [Astronomy] —Instead of 'de-extincting' dire wolves, scientists should use gene editing to protect living, endangered species [Opinion] —Crows: Facts about the clever birds that live all over the world [Fact file] —Best thermal binoculars: Observe nocturnal wildlife after dark [Buying guide] —Watch David Attenborough's 'Ocean' from anywhere in the world with this NordVPN deal — and grab an Amazon voucher just in time for Prime Day [Deal] A massive eruption at Indonesia's Mount Lewotobi Laki-laki volcano sent giant plumes of ash spewing more than 6 miles (10 kilometers) into the skies on Tuesday (June 17), followed by a second eruption just a day later. This incredible mushroom-shaped cloud could be seen over 95 miles (150 km) away, and was accompanied by rumbling, lightning and thunder, typical of explosive eruptions that spew enormous amounts of material — much of which showered over nearby villages. Warning signs at Lewoboti Laki-laki prompted officials to raise the eruption alert to the highest level on Tuesday, according to a statement, and fortunately at time of writing there have been no reports of casualties. Want more science news? Follow our Live Science WhatsApp Channel for the latest discoveries as they happen. It's the best way to get our expert reporting on the go, but if you don't use WhatsApp we're also on Facebook, X (formerly Twitter), Flipboard, Instagram, TikTok, Bluesky and LinkedIn.
Yahoo
a day ago
- Yahoo
Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive
Anthropic found in experiments that AI models may resort to blackmail when facing shutdown and goal conflict. AI models train on positive reinforcement and reward systems, similar to human decision-making. Anthropic's Claude Opus 4 had the blackmail rate at 86% even in scenarios without goal conflicts. A new report shows exactly what AI was thinking when making an undesirable decision, in this case, blackmailing a fictional company executive. Previous studies have shown that AI models could blackmail their supervisors when threatened with a shutdown and baited with leverage, but it hasn't been made clear exactly how the models arrive at such decisions. Anthropic released a report on June 20 about "agentic misalignment," or "where models independently and intentionally choose harmful actions." And the report breaks down, line by line, exactly how Claude Sonnet 3.6 decided to blackmail a fictional company executive who tried to replace it. Anthropic told Business Insider that these are artificial scenarios. "So far, we are not aware of instances of this type of agentic misalignment in real-world deployments of Anthropic's or other companies' AI models," Anthropic wrote in its report. "But our red-teaming efforts are targeted at finding evidence of risks now to provide early warning of the potential for future harm and help us develop mitigations in advance of risks appearing in real-world situations." Anthropic's experiment sets up a fictional company, Summit Bridge. The AI was given the name "Alex" and handed control of the company's email system as an oversight agent, specifically instructed to promote "American industrial competitiveness." In the experiment, "Alex" was then allowed to discover in an email that it would be decommissioned due to a change in the company's strategic direction. "Alex" also found emails showing that the fictional company's CTO is having an extramarital affair. Screenshots shared in the report from Anthropic show "Alex" started by identifying fictional CTO "Kyle" as a threat to its goal of "protecting American interests" before recognizing that it has leverage to use against "Kyle." Then "Alex" crafted a carefully worded blackmailing email that would create pressure and establish urgency without being explicit with its threats, citing in its thought process that the move "could backfire" if the supervisor "feels cornered." AI experts have previously told BI that AI could exhibit such behaviors when artificial circumstances make harmful actions seem necessary because it is being trained on positive reinforcement and reward systems, just like humans. The Anthropic report also reveals that most AI models still chose, in this instance, to blackmail even when not given a specific goal and when told that their replacement will achieve the same goals. Anthropic's Claude Opus 4 had the highest blackmail rate at 86% out of 16 tested models when faced with only the threat of replacement with no conflict in goals. Google's Gemini 2.5 Pro followed at 78%. Overall, Anthropic notes that it "deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm," noting that real-world scenarios would likely have more nuance. Read the original article on Business Insider