Teaching Thinking Models to Reason with Tools

A Full-Pipeline Recipe for Tool-Integrated Reasoning

arXiv Data and Models Homepage

Model Description

TRICE-4B is a tool-integrated reasoning model built on Qwen3-4B-Thinking-2507, capable of Textual Reasoning Interleaved with Code Execution. Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. To resolve this inherent instability, we move beyond scattered techniques and propose a systematic, full-pipeline recipe spanning data preparation, SFT, the transition from SFT to RL, and RL itself, with the goal of injecting natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability. The resulting TRICE-4B achieves state-of-the-art TIR performance, surpassing both existing TIR methods and frontier open-source reasoning models at the same or even larger parameter scales.

Key Highlights

  • State-of-the-art <10B TIR performance. TRICE-4B reaches 96.7% on AIME 2025, 86.7% on HMMT 2025, 71.3% on BeyondAIME, and 72.2% average across five competition-level math benchmarks with tools, yielding +14.0% average improvement over the base Qwen3-4B-Thinking-2507.
  • No-tool reasoning is preserved. The TIR capability is injected without degrading intrinsic reasoning mode — TRICE-4B retains or even improves the text-only reasoning ability on most benchmarks.
  • Cross-domain transfer. Although trained only on math data, the learned interleaved reasoning pattern transfers to different domains, with gains of up to +14.5% on FrontierScience, GPQA-Diamond, and LiveCodeBench.

Performance

We use a consistent configuration of 80K maximum rollout length and up to 128 tool calls in a stateful sandbox.

Performance

Generalization

Generalization

Usage

TRICE-4B supports both text-only and tool-integrated inference. For multi-turn tool-integrated reasoning, we recommend deploying via SGLang and pairing it with a stateful Python sandbox for code execution.

Citation

If you find this model or our recipe useful, please cite:

@misc{cheng2026teachingthinkingmodelsreason,
      title={Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning}, 
      author={Qianjia Cheng and Yuchen Zhang and Zhilin Wang and Yuxin Zuo and Shunkai Zhang and Yuchen Fan and Yu Qiao and Bowen Zhou and Ning Ding and Yu Cheng and Yun Luo and Ganqu Cui},
      year={2026},
      eprint={2605.06326},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.06326}, 
}

Acknowledgements

TRICE-4B is built on top of Qwen3-4B-Thinking-2507. Training is conducted with the Slime framework. We thank the open-source community for the models, tools, benchmarks, and infrastructure that made this work possible.

Downloads last month
44
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CajZella/TRICE-4B

Finetuned
(237)
this model
Quantizations
2 models

Collection including CajZella/TRICE-4B

Paper for CajZella/TRICE-4B