Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Abstract
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
Community
We extend a cognitive science-inspired episode annotation framework to an automatic, scalable, sentence-level representation that supports large-scale analysis of reasoning traces and conduct a systematic study of reasoning dynamics across a diverse set of LLMs. Moreover, we demonstrate the practical utility of episode-level representations through downstream case studies on correctness and efficiency, illustrating how reasoning dynamics can be analyzed beyond outcome-based metrics.
Key Findings:
When reasoning traces are analyzed at the episode level, a functional progression from abstract reasoning to concrete execution, and finally to evaluative control, consistently emerges. Episodes associated with analysis and exploration use more abstract, conceptual language and decrease steadily as reasoning progresses, while execution-oriented episodes dominate the middle of the trace through sustained concrete operations. In contrast, verification-related episodes are characterized by evaluative and meta-level language and increase toward the end of the reasoning process.
Comparing reasoning and non-reasoning models, the difference is not merely how many tokens they generate, but how reasoning is structured. Non-reasoning models allocate most of their response trace to execution, with episode transitions largely following a feed-forward pattern toward implementation. In contrast, reasoning models distribute effort across analysis, exploration, execution, and verification, and exhibit frequent iterative Explore-Monitor/Verify loops.
Through our correctness-oriented case study, we find that exploration reflects uncertainty and serves as a critical branching point: correct solutions more often route exploration into monitoring or re-analysis, whereas incorrect solutions tend to continue execution or terminate prematurely after exploration.
Through our efficiency-oriented case study, we find that different efficient reasoning methods selectively suppress evaluation-oriented episodes and feedback loops, leading to varying degrees of divergence from the reasoning patterns of the base model. Episode-level analysis thus reveals which episodes can be removed to gain efficiency, beyond token-level pruning.
We open-sourced the annotated data at https://github.com/MingLiiii/ThinkARM:
- All the responses generated by the models mentioned in our paper and corresponding annotations, including a corpus of 410,991 sentences, 150 responses, generated by 15 models solving 100 math problems.
- The human annotated gold episode labels, including 7,067 sentences.
- The code of ThinkARM, which can be directly used for annotating episodes.
- Some of the analysis code used in our paper.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper