Latest news with #DeepSeek-R1

Can you choose an AI model that harms the planet less?

Time of India

3 days ago

Science
Time of India

Can you choose an AI model that harms the planet less?

From uninvited results at the top of your search engine queries to offering to write your emails and helping students do homework, generative artificial intelligence is quickly becoming part of daily life as tech giants race to develop the most advanced models and attract users. All those prompts come with an environmental cost: A report last year from the Energy Department found AI could help increase the portion of the nation's electricity supply consumed by data centers from 4.4% to 12% by 2028. To meet this demand, some power plants are expected to burn more coal and natural gas. And some chatbots are linked to more greenhouse gas emissions than others. A study published Thursday in the journal Frontiers in Communication analyzed different generative AI chatbots' capabilities and the planet-warming emissions generated from running them. Researchers found that chatbots with bigger "brains" used exponentially more energy and answered questions more accurately -- up until a point. "We don't always need the biggest, most heavily trained model, to answer simple questions. Smaller models are also capable of doing specific things well," said Maximilian Dauner, a doctoral student at the Munich University of Applied Sciences and lead author of the paper. "The goal should be to pick the right model for the right task." The study evaluated 14 large language models, a common form of generative AI often referred to by the acronym LLMs, by asking each a set of 500 multiple choice and 500 free response questions across five different subjects. Dauner then measured the energy used to run each model and converted the results into carbon dioxide equivalents based on global most of the models tested, questions in logic-based subjects, like abstract algebra, produced the longest answers -- which likely means they used more energy to generate compared with fact-based subjects, such as history, Dauner said. Live Events AI chatbots that show their step-by-step reasoning while responding tend to use far more energy per question than chatbots that don't. The five reasoning models tested in the study did not answer questions much more accurately than the nine other studied models. The model that emitted the most, DeepSeek-R1, offered answers of comparable accuracy to those that generated a fourth of the amount of emissions. Discover the stories of your interest Blockchain 5 Stories Cyber-safety 7 Stories Fintech 9 Stories E-comm 9 Stories ML 8 Stories Edtech 6 Stories There is key information not captured by the study, which only included open-source LLMs: Some of the most popular AI programs made by large tech corporations, such as OpenAI's ChatGPT and Google's Gemini, were not included in the results. And because the paper converted the measured energy to emissions based on a global CO2 average, it only offered an estimate; it did not indicate the actual emissions generated by using these models, which can vary hugely depending on which country the data center running it is in. "Some regions are going to be powered by electricity from renewable sources, and some are going to be primarily running on fossil fuels," said Jesse Dodge, a senior research scientist at the Allen Institute for AI who was not affiliated with the new research. In 2022, Dodge led a study comparing the difference in greenhouse gas emissions generated by training a LLM in 16 different regions of the world. Depending on the time of year, some of the most emitting areas, like the central United States, had roughly three times the carbon intensity of the least emitting ones, such as Norway. But even with this limitation, the new study fills a gap in research on the trade-off between energy cost and model accuracy, Dodge said. "Everyone knows that as you increase model size, typically models become more capable, use more electricity and have more emissions," he said. Reasoning models, which have been increasingly trendy, are likely further bumping up energy costs, because of their longer answers. "For specific subjects an LLM needs to use more words to get to a more accurate response," Dauner said. "Longer answers and those that use a reasoning process generate more emissions." Sasha Luccioni, the AI and climate lead at Hugging Face, an AI company, said that subject matter is less important than output length, which is determined by how the model was trained. She also emphasized that the study's sample size is too small to create a complete picture of emissions from AI. "What's relevant here is not the fact that it's math and philosophy, it's the length of the input and the output," she said. Last year, Luccioni published a study that compared 88 LLMs and also found that larger models generally had higher emissions. Her results also indicated that AI text generation -- which is what chatbots do -- used 10 times as much energy compared with simple classification tasks like sorting emails into folders. Luccioni said that these kinds of "old school" AI tools, including classic search engine functions, have been overlooked as generative models have become more widespread. Most of the time, she said, the average person doesn't need to use an LLM at all. Dodge added that people looking for facts are better off just using a search engine, since generative AI can "hallucinate" false information. "We're reinventing the wheel," Luccioni said. People don't need to use generative AI as a calculator, she said. "Use a calculator as a calculator." This article originally appeared in The New York Times.

Apple researchers show how popular AI models ‘collapse' at complex problems

Indian Express

10-06-2025

Science
Indian Express

Apple researchers show how popular AI models ‘collapse' at complex problems

A new research paper by a group of people at Apple has said that artificial intelligence (AI) 'reasoning' is not all that it is cracked up to be. Through an analysis of some of the most popular large reasoning models in the market, the paper showed that their accuracy faces a 'complete collapse' beyond a certain complexity threshold. The researchers put to the test models like OpenAI o3-mini (medium and high configurations), DeepSeek-R1, DeepSeek-R1-Qwen-32B, and Claude-3.7- Sonnet (thinking). Their findings showed that the AI industry may be grossly overstating these models' capabilities. They also benchmarked these large reasoning models (LRMs) with large language models (LLMs) with no reasoning capabilities, and found that in some cases, the latter outperformed the former. 'In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives — an 'overthinking' phenomenon. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths. Beyond a certain complexity threshold, models completely fail to find correct solutions,' the paper said, adding that this 'indicates LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations'. For semantics, LLMs are AI models trained on vast text data to generate human-like language, especially in tasks such as translation and content creation. LRMs prioritise logical reasoning and problem-solving, focusing on tasks requiring analysis, like math or coding. LLMs emphasise language fluency, while LRMs focus on structured reasoning. To be sure, the paper's findings are a dampener on the promise of large reasoning models, which many have touted as a frontier breakthrough to understand and assist humans in solving complex problems, in sectors such as health and science. Apple researchers evaluated reasoning capabilities of LRMs through four controllable puzzle environments, which allowed them fine-grained control over complexity and rigorous evaluation of reasoning: Tower of Hanoi: It involves moving n disks between three pegs following specific rules, with complexity determined by the number of disks. Checker Jumping: This requires swapping red and blue checkers on a one-dimensional board, with complexity scaled by the number of checkers. River Crossing: This is a constraint satisfaction puzzle where and actors and n agents must cross a river, controlled by the number of actor/agent pairs and boat capacity. Blocks World: Focuses on rearranging blocks into a target configuration, with complexity managed by the number of blocks. 'Most of our experiments are conducted on reasoning models and their non-thinking counterparts, such as Claude 3.7 Sonnet (thinking/non-thinking) and DeepSeek-R1/V3. We chose these models because they allow access to the thinking tokens, unlike models such as OpenAI's o-series. For experiments focused solely on final accuracy, we also report results on the o-series models,' the researchers said. The researchers found that as problem complexity increased, the accuracy of reasoning models progressively declined. Eventually, their performance reached a complete collapse (zero accuracy) beyond a specific, model-dependent complexity threshold. Initially, reasoning models increased their thinking tokens proportionally with problem complexity. This indicates that they exerted more reasoning effort for more difficult problems. However, upon approaching a critical threshold (which closely corresponded to their accuracy collapse point), these models counter-intuitively began to reduce their reasoning effort (measured by inference-time tokens), despite the increasing problem difficulty. Their work also found that in cases where problem complexity is low, non-thinking models (LLMs) were capable to obtain performance comparable to, or even better than thinking models with more token-efficient inference. With medium complexity, the advantage of reasoning models capable of generating long chain-of-thought began to manifest, and the performance gap between LLMs and LRMs increased. But, where problem complexity is higher, the performance of both models collapsed to zero. 'Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,' the paper said. It is worth noting though that the researchers have acknowledged their work could have limitations: 'While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.' Soumyarendra Barik is Special Correspondent with The Indian Express and reports on the intersection of technology, policy and society. With over five years of newsroom experience, he has reported on issues of gig workers' rights, privacy, India's prevalent digital divide and a range of other policy interventions that impact big tech companies. He once also tailed a food delivery worker for over 12 hours to quantify the amount of money they make, and the pain they go through while doing so. In his free time, he likes to nerd about watches, Formula 1 and football. ... Read More

Apple researchers find ‘major' flaws in AI reasoning models ahead of WWDC 2025

Time of India

09-06-2025

Science
Time of India

Apple researchers find ‘major' flaws in AI reasoning models ahead of WWDC 2025

A newly published Apple Machine Learning Research study has challenged the prevailing idea that large-language models (LLMs) like OpenAI's o1 and Claude's thinking variants truly possess "reasoning" capabilities. The study indicates fundamental limitations in these AI systems. For this study, Apple researchers designed controllable puzzle environments, such as the Tower of Hanoi and the River Crossing. This approach avoided standard math benchmarks, which are susceptible to data contamination. According to the researchers, these custom environments allowed for a precise analysis of both the final answers produced by the LLMs and their internal reasoning traces across different complexity levels. What Apple researchers have found out from this study According to a report by MacRumors, the reasoning models tested by Apple's Research team, including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet, saw their accuracy collapse entirely once problem complexity crossed certain thresholds. Success rates dropped to zero even though the models had sufficient computational resources. Surprisingly, as problems became harder, the models reduced their reasoning effort. This points to fundamental scaling limitations rather than a lack of resources. Even more revealing, the models still failed at the same complexity points even when researchers provided complete solution algorithms. This indicates that the limitation lies in basic logical step execution, not in choosing the right problem-solving strategy. The models also showed puzzling inconsistencies. They were able to solve problems requiring over 100 moves but failed on simpler puzzles that needed only 11 moves. The study identified three performance patterns. Standard models unexpectedly performed better than reasoning models on low-complexity problems. Reasoning models had an advantage at medium complexity. Both types failed at high complexity. Researchers also discovered that models exhibited inefficient "overthinking" patterns, often discovering correct solutions early but wasting computational effort exploring incorrect alternatives. The key takeaway is that current "reasoning" models rely heavily on advanced pattern matching, not true reasoning. These models do not scale their reasoning the way humans do. They tend to overthink easy problems and think less when faced with harder ones. It is worth noting that this research surfaced just days before WWDC 2025. According to Bloomberg, Apple is expected to focus on new software designs rather than headline-grabbing AI features at this year's event. AI Masterclass for Students. Upskill Young Ones Today!– Join Now

Apple Debunks AI Reasoning Hype: Models Memorise, Don't Think, Study Reveals

NDTV

09-06-2025

NDTV

Apple Debunks AI Reasoning Hype: Models Memorise, Don't Think, Study Reveals

Apple has claimed that new-age artificial intelligence (AI) reasoning models might not be as smart as they have been made out to be. In a study titled, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, the tech giant claimed that reasoning models like Claude, DeepSeek-R1, and o3-mini do not actually reason at all. Apple claimed that these models simply memorise patterns really well, but when the questions are altered or the complexity increased, they collapse altogether. In simple terms, the models work great when they are able to match patterns, but once patterns become too complex, they fall away. "Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities," the study highlighted. "Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget," it added. For the study, the researchers flipped the script on the type of questions that reasoning models usually answer. Instead of the same old math tests, the models were presented with cleverly constructed puzzle games such as Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World. Each puzzle had simple, well-defined rules, and as the complexity was increased (like more disks, more blocks, more actors), the models needed to plan deeper and reason longer. The findings revealed three regimes. Low complexity: Regular models actually win. Medium complexity: Thinking models show some advantage. High complexity: Everything breaks down completely. AGI not as near as predicted? Apple reasoned that if the reasoning models were truly 'reasoning', they would be able to get better with more computing power and clear instructions. However, they started hitting walls and gave up, even when provided solutions. "When we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve," the study stated, adding: "Moreover, investigating the first failure move of the models revealed surprising behaviours. For instance, they could perform up to 100 correct moves in the Tower of Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle." With talks surrounding human-level AI, popularly referred to as Artificial General Intelligence (AGI), arriving as early as 2030, Apple's study suggests that it might not be the case, and we might be some distance away from sentient technology.

Thinking AI models collapse in face of complex problems, Apple researchers find

Hindustan Times

07-06-2025

Science
Hindustan Times

Thinking AI models collapse in face of complex problems, Apple researchers find

Just days ahead of the much-anticipated Worldwide Developer Conference (WWDC), Apple has released a study titled 'The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity', which saw researchers testing 'reasoning'; AI models such as Anthropic's Claude, OpenAI's o models, DeepSeek R1 and Google's Thinking models to see how far they can scale to replicate human reasoning. Spoiler alert — not as much, as the entire AI marketing pitch, would have you believe. Could this signal what may be in store for Apple's AI conversation ahead of the keynote? The study questions the current standard evaluation of Large Reasoning Models (LRMs) using established mathematical and coding benchmarks, arguing they suffer from data contamination and don't reveal insights into reasoning trace structure and quality. Instead, it proposes a controlled experimental testbed using algorithmic puzzle environments. The limitations of AI benchmarking, and need to evolve, is something we had written about earlier. 'We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,' the researcher paper points out. These findings are a stark warning to the industry — current LLMs are far from general-purpose reasoners. The emergence of Large Reasoning Models (LRMs), such as OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, has been hailed as a significant advancement, potentially marking steps toward more general artificial intelligence. These models characteristically generate responses following detailed 'thinking processes', such as a long Chain-of-Thought sequence, before providing a final answer. While they have shown promising results on various reasoning benchmarks, the capability of benchmarks to judge rapidly evolving models, itself is in doubt. The researchers cite a comparison between non-thinking LLMs and their 'thinking' evolution. 'At low complexity, non-thinking models are more accurate and token-efficient. As complexity increases, reasoning models outperform but require more tokens—until both collapse beyond a critical threshold, with shorter traces,' they say. The illustrative example of the Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking illustrates how both models retain accuracy till complexity level three, after which the standard LLM sees a significant drop, something the thinking model too suffers from, a couple of levels later. At the same time, the thinking model is using significantly more tokens. This research attempted to challenge prevailing evaluation paradigms, which often rely on established mathematical and coding benchmarks, which are otherwise susceptible to data contamination. Such benchmarks also primarily focus on final answer accuracy, providing limited insight into the reasoning process itself, something that is the key differentiator for a 'thinking' model compared with a simpler large language model. To address these gaps, the study utilises controllable puzzle environments — Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World — and these puzzles allow for precise manipulation of problem complexity while maintaining consistent logical structures and rules that must be explicitly followed. That structure theoretically opens a window, a glance at how these models attempt to 'think'. The findings from this controlled experimental setup reveal significant limitations in current frontier LRMs. One of the most striking observations is the complete accuracy collapse that occurs beyond certain complexity thresholds across all tested reasoning models. This is not a gradual degradation but a sharp drop to near-zero accuracy as problems become sufficiently difficult. 'The state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,' note the researchers. These results inevitably challenge any notion that the LRMs truly possess generalisation problem-solving skills, required for planning tasks or multi-step processes. The study also identifies a counter-intuitive scaling limit in the models' reasoning effort (this is measured by the inference token usage during the 'thinking' phase), which sees these models initially spend more tokens, but as complexity increases, they actually reduce reasoning effort closer to the inevitable accuracy collapse. Researchers say that 'despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood. Critical questions still persist: Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?,' they ask. There are further questions pertaining to performance scaling with increasing problem complexity, comparisons to the non-thinking standard LLM counterparts when provided with the same inference token compute, and around inherent limitations of current reasoning approaches, as well as improvements that might be necessary to advance toward more robust reasoning. Where do we go from here? The researchers make it clear that their test methodology too has limitations. 'While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge intensive reasoning problems,' they say. They do add that the use of 'deterministic puzzle simulators assumes that reasoning can be perfectly validated' at every step, a validation that may not be feasible to such precision in less structured domains. That they say, would restrict validity of analysis to more reasoning. There is little argument that LRMs represent progress, particularly for the relevance of AI. Yet, this study highlights that not all reasoning models are capable of robust, generalisable reasoning, particularly in the face of increasing complexity. These findings, ahead of WWDC 2025, and from Apple's own researchers, may suggest that any AI reasoning announcements will likely be pragmatic. The focus areas could include specific use cases where current AI methodology is reliable (the research paper indicates lower to medium complexity, less reliance on flawless long-sequence execution) and potentially integrating neural models with traditional computing approaches to handle the complexities where LRMs currently fail. The era of Large Reasoning Models is here, but this 'Illusion of thinking' study is that AI with true reasoning, remains a mirage.