Latest news with #OpenAIo3-mini

Apple researchers show how popular AI models ‘collapse' at complex problems

Indian Express

10-06-2025

Science
Indian Express

Apple researchers show how popular AI models ‘collapse' at complex problems

A new research paper by a group of people at Apple has said that artificial intelligence (AI) 'reasoning' is not all that it is cracked up to be. Through an analysis of some of the most popular large reasoning models in the market, the paper showed that their accuracy faces a 'complete collapse' beyond a certain complexity threshold. The researchers put to the test models like OpenAI o3-mini (medium and high configurations), DeepSeek-R1, DeepSeek-R1-Qwen-32B, and Claude-3.7- Sonnet (thinking). Their findings showed that the AI industry may be grossly overstating these models' capabilities. They also benchmarked these large reasoning models (LRMs) with large language models (LLMs) with no reasoning capabilities, and found that in some cases, the latter outperformed the former. 'In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives — an 'overthinking' phenomenon. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths. Beyond a certain complexity threshold, models completely fail to find correct solutions,' the paper said, adding that this 'indicates LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations'. For semantics, LLMs are AI models trained on vast text data to generate human-like language, especially in tasks such as translation and content creation. LRMs prioritise logical reasoning and problem-solving, focusing on tasks requiring analysis, like math or coding. LLMs emphasise language fluency, while LRMs focus on structured reasoning. To be sure, the paper's findings are a dampener on the promise of large reasoning models, which many have touted as a frontier breakthrough to understand and assist humans in solving complex problems, in sectors such as health and science. Apple researchers evaluated reasoning capabilities of LRMs through four controllable puzzle environments, which allowed them fine-grained control over complexity and rigorous evaluation of reasoning: Tower of Hanoi: It involves moving n disks between three pegs following specific rules, with complexity determined by the number of disks. Checker Jumping: This requires swapping red and blue checkers on a one-dimensional board, with complexity scaled by the number of checkers. River Crossing: This is a constraint satisfaction puzzle where and actors and n agents must cross a river, controlled by the number of actor/agent pairs and boat capacity. Blocks World: Focuses on rearranging blocks into a target configuration, with complexity managed by the number of blocks. 'Most of our experiments are conducted on reasoning models and their non-thinking counterparts, such as Claude 3.7 Sonnet (thinking/non-thinking) and DeepSeek-R1/V3. We chose these models because they allow access to the thinking tokens, unlike models such as OpenAI's o-series. For experiments focused solely on final accuracy, we also report results on the o-series models,' the researchers said. The researchers found that as problem complexity increased, the accuracy of reasoning models progressively declined. Eventually, their performance reached a complete collapse (zero accuracy) beyond a specific, model-dependent complexity threshold. Initially, reasoning models increased their thinking tokens proportionally with problem complexity. This indicates that they exerted more reasoning effort for more difficult problems. However, upon approaching a critical threshold (which closely corresponded to their accuracy collapse point), these models counter-intuitively began to reduce their reasoning effort (measured by inference-time tokens), despite the increasing problem difficulty. Their work also found that in cases where problem complexity is low, non-thinking models (LLMs) were capable to obtain performance comparable to, or even better than thinking models with more token-efficient inference. With medium complexity, the advantage of reasoning models capable of generating long chain-of-thought began to manifest, and the performance gap between LLMs and LRMs increased. But, where problem complexity is higher, the performance of both models collapsed to zero. 'Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,' the paper said. It is worth noting though that the researchers have acknowledged their work could have limitations: 'While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.' Soumyarendra Barik is Special Correspondent with The Indian Express and reports on the intersection of technology, policy and society. With over five years of newsroom experience, he has reported on issues of gig workers' rights, privacy, India's prevalent digital divide and a range of other policy interventions that impact big tech companies. He once also tailed a food delivery worker for over 12 hours to quantify the amount of money they make, and the pain they go through while doing so. In his free time, he likes to nerd about watches, Formula 1 and football. ... Read More

Alibaba's AI model Qwen3: A smart kid prone to hallucinations

Asia Times

02-05-2025

Business
Asia Times

Alibaba's AI model Qwen3: A smart kid prone to hallucinations

Alibaba Group's newly-released large language model Qwen3 has shown higher mathematical-proving and code-writing abilities than its previous models and some American peers, putting it at the top of benchmark charts. Qwen3 offers two mixture-of-experts (MoE) models (Qwen3-235B-A22B and Qwen3-32B-A3B) and six dense models. A MoE, also used by OpenAI's ChatGPT and Anthropic's Claude, can assign a specialized 'expert' model to answer questions on a specific topic. A dense model can perform a wide range of tasks, such as image classification and natural language processing, by learning complex patterns in data. Alibaba, a Hangzhou-based company, used 36 trillion tokens to train Qwen3, doubling the number used for training the Qwen2.5 model. DeepSeek, another Hangzhou-based firm, used 14.8 trillion tokens to train its R1 model. The higher the number of tokens used, the more knowledgeable an AI model is. At the same time, Qwen3 has a lower deployment threshold than DeepSeek V3, meaning users can deploy it at lower operating costs and with reduced energy consumption. Qwen3-235B-A22B features 235 billion parameters but requires activating only 22 billion. DeepSeek R1 features 671 billion parameters and requires activating 37 billion. Fewer parameters mean lower operation costs. The US stock market slumped after DeepSeek launched its R1 model on January 20. AI stock investors were shocked by DeepSeek R1's high performance and low training costs. Media reports said DeepSeek will unveil its R2 model in May. Some AI fans expected DeepSeek R2 to have greater reasoning ability than R1 and the ability to catch up with OpenAI o4-mini. Since Alibaba released Qwen3 early on the morning of April 29, AI fans have performed various tests to check its performance. The Yangtze Evening News reported that Qwen3 scored 70.7 on LiveCodeBench v5, which tests AI models' code-writing ability. This beat DeepSeek R1 (64.3), OpenAI o3-mini (66.3), Gemini2.5 Pro (70.4), and Grok 3 Beta (70.6). On AIME'24, which tests AI models' mathematical-proofing ability, Qwen3 scored 85.7, better than DeepSeek R1 (79.8), OpenAI o3-mini (79.6), and Grok 3 Beta (83.9). However, it lagged behind Gemini2.5 Pro, which scored 92. The newspaper's reporter found that Qwen3 fails to deal with complex reasoning tasks and lacks knowledge in some areas, resulting in 'hallucinations,' a typical situation in which an AI model provides false information. 'We asked Qwen3 to write some stories in Chinese. We feel that the stories are more delicate and fluent than those written by previous AI models, but their flows and scenes are illogical,' the reporter said. 'The AI model seems to be putting everything together without thinking.' In terms of scientific reasoning, Qwen3 scored 70%, lagging behind Gemini 2.5 Pro (84%), OpenAI o3-mini (83%), Grok 3 mini (79%), and DeepSeek R1 (71%), according to Artificial Analysis, an independent AI benchmarking & analysis company. In terms of reasoning and knowledge in humanity, Qwen3 scored 11.7%, beating Grok 3 mini (11.1%), Claude 3.7 (10.3%), and DeepSeek R1 (9.3%). However, it still lagged behind OpenAI o3-mini (20%) and Gemini 2.5 Pro (17.1%). In February of this year, Microsoft Chief Executive Satya Nadella said that focusing on self-proclaimed milestones, such as achieving artificial general intelligence (AGI), is only a form of 'nonsensical benchmark hacking.' He said an AI model can declare victory only if it helps achieve a 10% annual growth in gross domestic product. While Chinese AI firms need more time to catch up with American players, they face a new challenge – a shortage of AI chips. In early April, Chinese media reported that ByteDance, Alibaba, and Tencent reportedly ordered more than 100,000 H20 chips from Nvidia for 16 billion yuan (US$2.2 billion). On April 15, Nvidia said it had been informed by the US government informed that the company would need a license to ship its H20 AI chips to China. The government cited the risk that Chinese firms would use the H20 chips in supercomputers. The Information reported on May 2 that Nvidia had told some of its biggest Chinese customers that it is tweaking the design of its AI chips so they can continue to ship AI chips to China. A sample of the new chip will be available as early as June. Nvidia has already tailored AI chips for the Chinese market several times. After Washington restricted the export of A100 and H100 chips to China in October 2022, Nvidia designed the A800 and H800 chips. However, the US government extended its export controls to cover them in October 2023. Then, Nvidia unveiled the H20. Although the H20 only performs equivalent to 15% of the H100, Chinese firms are still rushing to buy it, instead of Huawei's Ascend 910B chip, which faces a limited supply due to a low production yield. A Chinese IT columnist said the Ascend 910B is a faster chip than the H20, but the H20's bandwidth is ten times that of the 910B's. He said a higher bandwidth in an AI chip, like a better gearbox in a sports car, can achieve a more stable performance. The Application of Electronic Technique, a Chinese scientific journal, said China's AI firms could try to use homegrown chips, such as Cambricon Technologies' Siyuan 590, Hygon Information Technology's DCU series, Moore Threads' MTT S80, Biren Technology's BR104, or Huawei's upcoming Ascend 910C. Read: After DeepSeek: China's Manus – the hot new AI under the spotlight

Latest news with #OpenAIo3-mini

Apple researchers show how popular AI models ‘collapse' at complex problems

Alibaba's AI model Qwen3: A smart kid prone to hallucinations

Get Started Now: Download the App