Teaching Thinking Models to Reason with Tools

A Full-Pipeline Recipe for Tool-Integrated Reasoning

arXiv Data and Models Homepage

Model Description

TRICE-30B is a tool-integrated reasoning model built on Qwen3-30B-A3B-Thinking-2507, capable of Textual Reasoning Interleaved with Code Execution. Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. To resolve this inherent instability, we move beyond scattered techniques and propose a systematic, full-pipeline recipe spanning data preparation, SFT, the transition from SFT to RL, and RL itself, with the goal of injecting natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability. The resulting TRICE-30B achieves state-of-the-art TIR performance, surpassing both existing TIR methods and frontier open-source reasoning models at the same or even larger parameter scales.

Key Highlights

  • State-of-the-art ~30B TIR performance. TRICE-30B reaches 99.2% on AIME 2025, 92.5% on HMMT 2025, 82.5% on BeyondAIME, and 81.9% average across five competition-level math benchmarks with tools, yielding +14.8% average improvement over the base Qwen3-30B-A3B-Thinking-2507.
  • No-tool reasoning is preserved. The TIR capability is injected without degrading intrinsic reasoning mode — TRICE-30B retains or even improves the text-only reasoning ability on most benchmarks.
  • Strong on the hardest problems. On APEX 2025, a collection of national and international Olympiad problems where most open-source models score near zero, TRICE-30B reaches 16.7%.
  • Surpasses much larger text-only models. TRICE-30B with tools outperforms Qwen3-235B-A22B-Thinking and DeepSeek-V3.2-Thinking in text-only mode on HMMT 2025, BeyondAIME, and IMOAnswerBench.
  • Cross-domain transfer. Although trained only on math data, the learned interleaved reasoning pattern transfers to different domains, with gains of up to +11.7% on FrontierScience, GPQA-Diamond, and LiveCodeBench.

Performance

Unless otherwise noted, all models are evaluated under our unified protocol on five competition-level benchmarks: AIME 2025, HMMT 2025, BeyondAIME, IMOAnswerBench, and APEX 2025. Every question is repeated 8 times to ensure reproducibility. We use a consistent configuration of 80K maximum rollout length and up to 128 tool calls in a stateful sandbox.

Performance

Generalization

Generalization

Usage

TRICE-30B supports both text-only and tool-integrated inference. For multi-turn tool-integrated reasoning, we recommend deploying via SGLang and pairing it with a stateful Python sandbox for code execution.

Citation

If you find this model or our recipe useful, please cite:

@misc{cheng2026teachingthinkingmodelsreason,
      title={Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning}, 
      author={Qianjia Cheng and Yuchen Zhang and Zhilin Wang and Yuxin Zuo and Shunkai Zhang and Yuchen Fan and Yu Qiao and Bowen Zhou and Ning Ding and Yu Cheng and Yun Luo and Ganqu Cui},
      year={2026},
      eprint={2605.06326},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.06326}, 
}

Acknowledgements

TRICE-30B is built on top of Qwen3-30B-A3B-Thinking-2507. Training is conducted with the Slime framework. We thank the open-source community for the models, tools, benchmarks, and infrastructure that made this work possible.

Downloads last month
16
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CajZella/TRICE-30B

Finetuned
(38)
this model

Collection including CajZella/TRICE-30B

Paper for CajZella/TRICE-30B