TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow
Abstract
TheoremForge is a cost-effective formal data synthesis framework that decomposes mathematical formalization into sub-tasks and uses a decoupled extraction strategy to recover valid training signals from failed trajectories, achieving higher data yield and lower costs than traditional approaches.
The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce TheoremForge, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are statement formalization, proof generation, premise selection, proof correction and proof sketching. By implementing a Decoupled Extraction Strategy, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \0.481 per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by 1.6\times$ for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available https://github.com/timechess/TheoremForge{here}.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper