logo
Thinking AI models collapse in face of complex problems, Apple researchers find

Thinking AI models collapse in face of complex problems, Apple researchers find

Hindustan Times07-06-2025

Just days ahead of the much-anticipated Worldwide Developer Conference (WWDC), Apple has released a study titled 'The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity', which saw researchers testing 'reasoning'; AI models such as Anthropic's Claude, OpenAI's o models, DeepSeek R1 and Google's Thinking models to see how far they can scale to replicate human reasoning. Spoiler alert — not as much, as the entire AI marketing pitch, would have you believe. Could this signal what may be in store for Apple's AI conversation ahead of the keynote?
The study questions the current standard evaluation of Large Reasoning Models (LRMs) using established mathematical and coding benchmarks, arguing they suffer from data contamination and don't reveal insights into reasoning trace structure and quality. Instead, it proposes a controlled experimental testbed using algorithmic puzzle environments. The limitations of AI benchmarking, and need to evolve, is something we had written about earlier.
'We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,' the researcher paper points out. These findings are a stark warning to the industry — current LLMs are far from general-purpose reasoners.
The emergence of Large Reasoning Models (LRMs), such as OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, has been hailed as a significant advancement, potentially marking steps toward more general artificial intelligence. These models characteristically generate responses following detailed 'thinking processes', such as a long Chain-of-Thought sequence, before providing a final answer. While they have shown promising results on various reasoning benchmarks, the capability of benchmarks to judge rapidly evolving models, itself is in doubt.
The researchers cite a comparison between non-thinking LLMs and their 'thinking' evolution. 'At low complexity, non-thinking models are more accurate and token-efficient. As complexity increases, reasoning models outperform but require more tokens—until both collapse beyond a critical threshold, with shorter traces,' they say. The illustrative example of the Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking illustrates how both models retain accuracy till complexity level three, after which the standard LLM sees a significant drop, something the thinking model too suffers from, a couple of levels later. At the same time, the thinking model is using significantly more tokens.
This research attempted to challenge prevailing evaluation paradigms, which often rely on established mathematical and coding benchmarks, which are otherwise susceptible to data contamination. Such benchmarks also primarily focus on final answer accuracy, providing limited insight into the reasoning process itself, something that is the key differentiator for a 'thinking' model compared with a simpler large language model. To address these gaps, the study utilises controllable puzzle environments — Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World — and these puzzles allow for precise manipulation of problem complexity while maintaining consistent logical structures and rules that must be explicitly followed. That structure theoretically opens a window, a glance at how these models attempt to 'think'.
The findings from this controlled experimental setup reveal significant limitations in current frontier LRMs. One of the most striking observations is the complete accuracy collapse that occurs beyond certain complexity thresholds across all tested reasoning models. This is not a gradual degradation but a sharp drop to near-zero accuracy as problems become sufficiently difficult.
'The state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,' note the researchers.
These results inevitably challenge any notion that the LRMs truly possess generalisation problem-solving skills, required for planning tasks or multi-step processes. The study also identifies a counter-intuitive scaling limit in the models' reasoning effort (this is measured by the inference token usage during the 'thinking' phase), which sees these models initially spend more tokens, but as complexity increases, they actually reduce reasoning effort closer to the inevitable accuracy collapse.
Researchers say that 'despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood. Critical questions still persist: Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?,' they ask. There are further questions pertaining to performance scaling with increasing problem complexity, comparisons to the non-thinking standard LLM counterparts when provided with the same inference token compute, and around inherent limitations of current reasoning approaches, as well as improvements that might be necessary to advance toward more robust reasoning.
Where do we go from here?
The researchers make it clear that their test methodology too has limitations. 'While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge intensive reasoning problems,' they say. They do add that the use of 'deterministic puzzle simulators assumes that reasoning can be perfectly validated' at every step, a validation that may not be feasible to such precision in less structured domains. That they say, would restrict validity of analysis to more reasoning.
There is little argument that LRMs represent progress, particularly for the relevance of AI. Yet, this study highlights that not all reasoning models are capable of robust, generalisable reasoning, particularly in the face of increasing complexity. These findings, ahead of WWDC 2025, and from Apple's own researchers, may suggest that any AI reasoning announcements will likely be pragmatic. The focus areas could include specific use cases where current AI methodology is reliable (the research paper indicates lower to medium complexity, less reliance on flawless long-sequence execution) and potentially integrating neural models with traditional computing approaches to handle the complexities where LRMs currently fail. The era of Large Reasoning Models is here, but this 'Illusion of thinking' study is that AI with true reasoning, remains a mirage.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

How to use ChatGPT to create Images directly on WhatsApp
How to use ChatGPT to create Images directly on WhatsApp

India Today

time3 hours ago

  • India Today

How to use ChatGPT to create Images directly on WhatsApp

How to use ChatGPT to create Images directly on WhatsApp By Divya Bhati You can now generate AI images directly in your WhatsApp chat with ChatGPT. ChatGPT now creates images on WhatsApp Add 1-800-ChatGPT (+1-800-242-8478) to your contacts; this is the verified number by OpenAI. Save the official ChatGPT number Once the number is saved. Open WhatsApp, find the saved number, and send a simple 'Hi' to begin chatting. Start the conversation ChatGPT will ask you to verify your OpenAI account via a secure login—this connects your access to the image tools. Link your OpenAI account Once linked, you can type a prompt like 'a dragon flying over a neon city' and ChatGPT will create it using AI. Image prompt The images are generated through OpenAI's DALL·E model, known for high-quality, creative visuals. Powered by DALL·E You can also ask the chatbot to refine or modify generated images—ask it to add, remove, or tweak parts of the image. Edit Images with prompts No beta invite or special access needed—if you have an OpenAI account, you can use it now on WhatsApp. Available for everyone

From the Opinions Editor: India needs a well thought out trade strategy, but first it needs a China strategy
From the Opinions Editor: India needs a well thought out trade strategy, but first it needs a China strategy

Indian Express

time6 hours ago

  • Indian Express

From the Opinions Editor: India needs a well thought out trade strategy, but first it needs a China strategy

Dear Express Reader Over the past 11 years, the Narendra Modi government has taken several steps to shore up the economic momentum, and put the country on a higher growth trajectory. But, despite its efforts to ensure macroeconomic stability, revive private sector investments and boost household consumption, growth has been less than spectacular. Between 2014-15 and 2024-25, the economy grew at an average of just 6.2 per cent. Now, in its third term, whether pushed by Donald Trump's tariff war or the imperatives of growth, the government is making a determined effort to sew up trade agreements, hoping they will help embed the country into global supply chains, catalyse exports, and push up growth. A trade deal has been struck with the UK, and talks are proceeding with the US and the EU, with many of the issues that have previously held back these agreements being either resolved or sidestepped. These agreements will ensure greater market access and bring down tariffs, improving competitiveness of exports. But the question is: Will these trade deals be enough? Can they alone facilitate India's deep integration with global supply chains? Can the country emerge as a major production hub without integrating more closely with the supply chains that run through South and East Asia which form a vital part of global production systems? The case of Apple is instructive. The dramatic scaling up of the Apple ecosystem in the country — the company has recently said that iPhones sold in the US market will be mostly sourced from India — is a remarkable development. It is a consequence of both the government's production linked incentive scheme and the firm wanting to diversify its production bases away from China. Now, Apple provides a supplier list — a list that represents 98 per cent of the company's direct spend for materials, manufacturing and assembly of its products worldwide. This would include suppliers not only those involved in the production of the iPhone but also in other Apple products. As per this list, in 2023, 156 of the company's suppliers had manufacturing locations in China, 42 suppliers were located in Japan, 35 in Vietnam and 33 in South Korea, and 14 in India. Two years later the numbers would have changed slightly — as per a recent report there are now more than 20 component suppliers in India — but, they would still point towards the centrality of South and East Asia, and China in particular, to the global production system — a fact that cannot be ignored. If India wants to be a part of the production chain of other Apple products and grab a greater share of the value addition in the production process, it would need the smooth flow of components/materials into the country and more component manufacturers to be located here. And therein lies India's conundrum. What is India's China strategy? Should the country also be a part of RCEP (Regional Comprehensive Economic Partnership) and CPTPP (Comprehensive and Progressive Agreement for Trans-Pacific Partnership)? In 2019, India chose not to be part of RCEP — the trade agreement that spans China, Japan, South Korea, Australia, New Zealand and the 10 ASEAN member states (Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, and Vietnam). The decision to not join was in large part attributed to concerns over China. But the trade relationship with China has only deepened since. And that is the reality, contrary to the desire of reducing the dependence on China. In 2018-19, before India withdrew from RCEP, its trade deficit with China stood at $53.5 billion. By 2024-25, it had surged to $99.2 billion, without RCEP. India, though, is not alone. Even as the US has tried to reduce its reliance on China, its deficit with the country, though it has declined in recent years, stood at a staggering $295 billion in 2024. And this does not account for rerouting of exports through other countries. But, it's not just about companies like Apple. The issue around rare earth minerals — used in a range of sectors such as smartphones, TVs, EV cars, solar panels and jet engines — underlines China's centrality to the global production system. This reality cannot be wished away. China accounts for 90 per cent of global processing of rare earths. With the country placing restrictions on its exports, EV manufacturers in India have reportedly sought the government's intervention in the matter. If these supplies continue to be restricted, India's EV push, and thus its efforts in shifting towards a cleaner vehicle fleet, risk being affected. And that won't be the only sector that is likely to be impacted. There are some reports which suggest that the government has raised the issue of export curbs on rare earth minerals and magnets with China. But it's not just India. Even the US has been affected. In fact, one of the key aspects of the US-China agreement that was announced by Donald Trump is the upfront export of full magnets, and any necessary rare earths by China. It is difficult to see companies move their production to India on the scale that is needed for the country to emerge as a manufacturing powerhouse unless they can be sure of stable trade relations, of supply chains working smoothly, of the seamless movement of components/personnel from other jurisdictions. India needs a well thought out trade strategy. The lack of clarity partly explains the sluggish pace of investments in the country by domestic as well as foreign firms — both of whom seem to be more inclined to invest in other jurisdictions presumably because the risk-return matrix is not as favourable in India. A clear strategy should give these firms the confidence needed to invest in the country. Take care, Ishan

Most expensive iPhone is made for just Rs 42000 but Apple sells it for Rs 1.32 lakh due to...
Most expensive iPhone is made for just Rs 42000 but Apple sells it for Rs 1.32 lakh due to...

India.com

time8 hours ago

  • India.com

Most expensive iPhone is made for just Rs 42000 but Apple sells it for Rs 1.32 lakh due to...

iPhone price in India New Delhi: American tech giant Apple sells its iPhones in various models at premium prices, but did you know that the actual manufacturing cost of these devices is significantly lower? Last year, the most expensive models were iPhone 16 series and iPhone 16 Pro Max. But have you ever wondered how much it actually costs to make this phone that sells for lakhs? In this article, we will tell you the cost of making these handsets. When the actual cost is so low, you might wonder why Apple charges more than double the price from customers. Today, we're going to tell you about the manufacturing cost of the iPhone 16 Pro Max. In fact, shortly after this phone was launched last year, a report was released revealing details about its manufacturing cost. Manufacturing Cost of iPhone 16 Pro Max The Bill of Materials (BOM) cost of the iPhone 16 Pro Max is USD 485 (approximately Rs 41,992 or Rs 42,000), according to market research firm TD Cowen. The report also stated that this is slightly higher than the cost of the iPhone 15 Pro Max, which was USD 453 (around ₹39,222). Why does a phone made for Rs 41,000 sell for over a lakh? It's important to note that the BOM only includes the cost of raw materials and assembly. The final retail price also factors in expenses like software development, marketing, and logistics, which significantly increase the overall cost. Currently, the 256GB variant of the iPhone 16 Pro Max is being sold on Flipkart for Rs 1,32,900. Check Key Details Here: The higher cost of the iPhone 16 Pro Max compared to the iPhone 15 Pro Max is due to the upgraded hardware components used in the handset. The display and rear camera system of the iPhone 16 Pro Max are the two most expensive parts, costing around ₹6,700. In comparison, these parts in the iPhone 15 Pro Max cost Rs 6,300 and Rs 5,900 respectively. The introduction of new LPDDR5X RAM technology has also added to the total cost With the RAM in the iPhone 16 Pro Max priced at Rs 1,400, whereas the older LPDDR5 RAM in the iPhone 15 Pro Max cost only Rs 1,000. The A18 Pro chipset and storage in the iPhone 16 Pro Max cost Rs 3,400 and Rs 1,900 respectively. Even after accounting for logistics and software development, Apple maintains a healthy gross margin and earns a significant profit on each model of the iPhone 16 Pro Max.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store