arxiv:2512.19995

Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Published on Dec 23

· Submitted by

Authors:

Abstract

Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

View arXiv page View PDF GitHub 2 Add to collection

Community

zhoutianyi

Paper submitter about 10 hours ago

We extend a cognitive science-inspired episode annotation framework to an automatic, scalable, sentence-level representation that supports large-scale analysis of reasoning traces and conduct a systematic study of reasoning dynamics across a diverse set of LLMs. Moreover, we demonstrate the practical utility of episode-level representations through downstream case studies on correctness and efficiency, illustrating how reasoning dynamics can be analyzed beyond outcome-based metrics.

Key Findings:

When reasoning traces are analyzed at the episode level, a functional progression from abstract reasoning to concrete execution, and finally to evaluative control, consistently emerges. Episodes associated with analysis and exploration use more abstract, conceptual language and decrease steadily as reasoning progresses, while execution-oriented episodes dominate the middle of the trace through sustained concrete operations. In contrast, verification-related episodes are characterized by evaluative and meta-level language and increase toward the end of the reasoning process.
Comparing reasoning and non-reasoning models, the difference is not merely how many tokens they generate, but how reasoning is structured. Non-reasoning models allocate most of their response trace to execution, with episode transitions largely following a feed-forward pattern toward implementation. In contrast, reasoning models distribute effort across analysis, exploration, execution, and verification, and exhibit frequent iterative Explore-Monitor/Verify loops.
Through our correctness-oriented case study, we find that exploration reflects uncertainty and serves as a critical branching point: correct solutions more often route exploration into monitoring or re-analysis, whereas incorrect solutions tend to continue execution or terminate prematurely after exploration.
Through our efficiency-oriented case study, we find that different efficient reasoning methods selectively suppress evaluation-oriented episodes and feedback loops, leading to varying degrees of divergence from the reasoning patterns of the base model. Episode-level analysis thus reveals which episodes can be removed to gain efficiency, beyond token-level pruning.

zhoutianyi

Paper submitter about 10 hours ago

•

edited about 10 hours ago

We open-sourced the annotated data at https://github.com/MingLiiii/ThinkARM:

All the responses generated by the models mentioned in our paper and corresponding annotations, including a corpus of 410,991 sentences, 150 responses, generated by 15 models solving 100 math problems.
The human annotated gold episode labels, including 7,067 sentences.
The code of ThinkARM, which can be directly used for annotating episodes.
Some of the analysis code used in our paper.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.19995 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.19995 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.19995 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.