Apple researchers: LLMs have fundamental limitations on reasoning ability

12 Oct, 2024

Link: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (PDF)

GSM-Symbolic is a new benchmark that the Apple research team is introducing. Rather than using a static set of mathematical problems to gauge the mathematical reasoning ability of LLMs, it uses templates to create "diverse question variants". This should increase the reliability of test results. Their early testing shows that state-of-the art LLMs are not nearly as capable as testing with the static GSM8K questions has suggested.

Despite the demonstration of "remarkable capabilities",

the question of whether current LLMs are genuinely capable of true logical reasoning remains an important research focus. While some studies highlight impressive capabilities, a closer examination reveals substantial limitations. Literature suggests that the reasoning process in LLMs is probabilistic pattern-matching rather than formal reasoning (Jiang et al., 2024).

The Apple team introduces a benchmark (GSM-Dynamic) and dataset (GSM-NoOp) that demonstrate that LLMs "are very sensitive to changes in numerical values", and struggle to distinguish between relevant and irrelevant information:

By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrate substantial performance drops (up to 65%) across all state-of-the-art models (Sec. 4.4). This reveals a critical flaw in the models’ ability to discern relevant information for problem-solving, likely because their reasoning is not formal in the common sense term and is mostly based on pattern matching.

They conclude:

We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills. This remains a critical challenge for the field as we strive to create systems with human-like cognitive abilities or general intelligence.

In other words, LLMs are are likely limited to probabilistic pattern matching, making them a dead-end on the tech tree. A different approach will be needed to achieve artificial general intelligence (AGI).

#apple #machine learning