The Illusion of Thinking: A Critical Debate on AI Reasoning Capabilities
Apple's provocative research challenges the reasoning capabilities of Large Reasoning Models, sparking a heated debate that reveals as much about how we evaluate AI as it does about AI itself.
Author: Macaulan Serván-Chiaramonte
In 2025, Apple's Machine Learning Research team, led by Parshin Shojaee with co-authors including Samy Bengio and Mehrdad Farajtabar, released a paper titled "The Illusion of Thinking" that sent shockwaves through the AI community. The research claimed to demonstrate that Large Reasoning Models (LRMs) like OpenAI's o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking aren't truly reasoning; they're merely creating an illusion of thought. This bold assertion sparked immediate controversy and a swift rebuttal that raises fundamental questions about how we evaluate and understand AI capabilities.
Apple's Challenge to AI Reasoning
The Apple research team designed a series of experiments using controllable puzzle environments that allowed them to systematically scale complexity while maintaining consistent logical structures. Their approach centered on testing whether models could demonstrate genuine reasoning capabilities rather than sophisticated pattern matching.
Key Findings from Apple's Research
- Complete Accuracy Collapse: Models that achieved ~90% success on simpler problems dropped to near-zero accuracy with small increases in complexity.
- Counter-intuitive Scaling: As problems became more complex, models paradoxically decreased their reasoning effort despite having adequate token budgets.
- Three Performance Regimes: Low complexity (standard LLMs outperform LRMs), medium complexity (LRMs show advantage), and high complexity (both fail completely).
The researchers used four primary test scenarios, with the Tower of Hanoi puzzle serving as their flagship benchmark. In this classic problem, disks must be moved between pegs following specific rules. Apple reported that models showed "complete accuracy collapse" when attempting to solve versions with 10 or more disks.
"The 'thinking' displayed by LRMs may be more performative than functional... Models appear to rely on pattern matching rather than true logical reasoning."
The Swift Rebuttal: "The Illusion of the Illusion of Thinking"
In June 2025, C. Opus from Anthropic and A. Lawsen from Open Philanthropy published a comprehensive rebuttal titled "The Illusion of the Illusion of Thinking: A Comment on Shojaee et al. (2025)." This response paper, notable as an academic paper co-authored by an AI critiquing research about AI capabilities, systematically challenged Apple's methodology and conclusions.
Critical Flaws Identified in Apple's Methodology
1. Token Limit Misinterpretation
The rebuttal demonstrated that what Apple interpreted as "reasoning collapse" was actually models hitting output token limits. As the authors noted, "The question isn't whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing." For example:
- A 15-disk Tower of Hanoi solution requires over 32,000 moves to print
- Models explicitly stated constraints like "I'll stop here to avoid making this too long"
- The issue was practical output limitations, not reasoning failure
2. Flawed Evaluation Framework
Apple's automated evaluation system couldn't distinguish between:
- Actual reasoning failures
- Practical output constraints
- Models correctly identifying unsolvable problems
3. Impossible Problem Instances
Apple included mathematically unsolvable River Crossing puzzles (for N≥6 with boat capacity of 3) in their test set, then penalized models for not solving them. As Opus and Lawsen pointed out, "Models receive zero scores not for reasoning failures, but for correctly recognizing unsolvable problems."
The Proof in the Code
To definitively counter Apple's claims, the rebuttal authors employed an elegant solution: instead of asking models to list every move, they tested whether models could generate recursive functions that solve the problems algorithmically. By removing format constraints, they demonstrated that models could complete solutions in under 5,000 tokens. The results were striking:
Alternative Testing Results:
- Multiple models successfully generated Tower of Hanoi solutions for N=15 using alternative representations
- This was far beyond the complexity where Apple reported zero success
- When freed from token constraints, models demonstrated effective reasoning capabilities
- The rebuttal showed high accuracy across multiple models when evaluation constraints were relaxed
This simple change in methodology revealed that the models could indeed reason through complex problems. They were simply constrained by output limitations in Apple's testing framework.
Broader Implications for AI Evaluation
This debate extends far beyond a technical disagreement about puzzle-solving. It highlights critical issues in how we evaluate AI capabilities:
1. The Challenge of Benchmark Design
Creating benchmarks that accurately measure AI reasoning is incredibly difficult. Researchers must carefully consider:
- Practical constraints vs. fundamental limitations
- The difference between computational reasoning and exhaustive enumeration
- How to account for different output modalities (natural language vs. code)
2. The Memorization vs. Reasoning Debate
While Apple's specific methodology was flawed, they raised valid questions about whether LLMs truly reason or merely pattern-match from training data. This remains an open and nuanced question in AI research.
3. The Importance of Experimental Design
This controversy underscores how experimental design can dramatically influence conclusions about AI capabilities. Small details in how we structure tests can lead to vastly different interpretations of what AI can and cannot do.
What This Means for the Future of AI
The "Illusion of Thinking" debate offers several key takeaways for the AI community:
For Researchers
Benchmark design requires extreme care to avoid conflating practical constraints with fundamental limitations. Multi-faceted evaluation approaches that test capabilities through different modalities are essential.
For Practitioners
Understanding the practical constraints of AI systems is as important as understanding their theoretical capabilities. Real-world deployment often involves working around limitations rather than waiting for perfect solutions.
For the Industry
The rapid response and peer review process demonstrated here, including AI participation in academic discourse, suggests a new model for accelerated scientific debate and validation in the AI era.
Conclusion: Beyond the Illusion
The "Illusion of Thinking" controversy reveals that the question isn't simply whether AI can reason, but rather how we define and measure reasoning in artificial systems. Apple's research, despite its methodological flaws, contributes to an essential conversation about AI capabilities and limitations. The swift and thorough rebuttal demonstrates the value of rigorous peer review and the importance of questioning bold claims, even when they come from prestigious institutions.
Perhaps most intriguingly, this debate marks a milestone in AI development: an AI system (C. Opus from Anthropic) co-authored an academic response to research questioning AI reasoning capabilities. This meta-level development suggests we're entering an era where AI doesn't just perform tasks but actively participates in scientific discourse about its own nature and capabilities.
As we continue to push the boundaries of AI capabilities, this controversy reminds us that careful experimental design, thoughtful evaluation methods, and open scientific debate remain our best tools for understanding what artificial intelligence can truly achieve. The illusion, it seems, may lie not in AI's thinking but in our methods of observing it.
Further Reading
- "The Illusion of Thinking" - Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). Apple Machine Learning Research
- "The Illusion of the Illusion of Thinking: A Comment on Shojaee et al. (2025)" - Opus, C. (Anthropic) & Lawsen, A. (Open Philanthropy). arXiv:2506.09250v1