✨ NEW Paper from Apple: The Illusion of Thinking in LLMs Apple researchers discuss the strengths and limitations of reasoning models. Apparently, reasoning models "collapse" beyond certain task complexities. Lots of important insights on this one. (bookmark it!) Here are my notes: ⭐ Paper Overview Investigates the capabilities and limitations of frontier Large Reasoning Models (LRMs) like Claude 3.7, DeepSeek-R1, and OpenAI’s o-series by systematically analyzing their performance on reasoning tasks as a function of problem complexity. Rather than relying on conventional math benchmarks, which suffer from contamination and lack structure, the authors evaluate LRMs using four controllable puzzles (Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World) that allow fine-grained complexity scaling and transparent trace analysis. ⭐ Three complexity regimes The study identifies distinct performance phases. In low-complexity tasks, non-thinking LLMs outperform LRMs due to more efficient and direct computation. In medium complexity, reasoning models show an advantage, leveraging longer chain-of-thoughts to correct errors. However, in high complexity, all models, regardless of their reasoning scaffolds, collapse to near-zero accuracy. ⭐ Counterintuitive reasoning collapse Surprisingly, LRMs reduce their reasoning effort (i.e., number of tokens used in thoughts) as problem complexity increases beyond a threshold. This suggests an internal scaling failure not caused by token limits but by intrinsic model behavior. ⭐ Reasoning trace inefficiencies LRMs frequently “overthink” on simple problems, finding correct answers early but continuing to explore incorrect paths. For moderate tasks, they correct late; and for complex ones, they fail to find any valid solution. Position-based accuracy analysis of thoughts reveals systematic shifts in when correct solutions are generated within the trace. ⭐ Failure to execute explicit algorithms Even when supplied with correct pseudocode (e.g., Tower of Hanoi recursion), models still failed at similar complexity points. This indicates that LRMs don’t just struggle to find solutions; they can’t reliably execute logical instructions either. ⭐ Inconsistent behavior across puzzles Models could perform >100 correct steps in Tower of Hanoi (N=10) but fail after 4 steps in River Crossing (N=3), suggesting performance correlates more with training data familiarity than inherent problem complexity. Overall, this paper challenges the assumption that LRMs are progressing steadily toward generalizable reasoning. It argues that existing “thinking” enhancements provide local, not scalable, benefits, raising critical questions about inference-time scaling, symbolic reasoning, and robustness of these emerging systems.
-
+2