Instructions to use CajZella/TRICE-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CajZella/TRICE-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CajZella/TRICE-4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("CajZella/TRICE-4B")
model = AutoModelForCausalLM.from_pretrained("CajZella/TRICE-4B", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use CajZella/TRICE-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CajZella/TRICE-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CajZella/TRICE-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CajZella/TRICE-4B

SGLang

How to use CajZella/TRICE-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CajZella/TRICE-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CajZella/TRICE-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CajZella/TRICE-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CajZella/TRICE-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use CajZella/TRICE-4B with Docker Model Runner:
```
docker model run hf.co/CajZella/TRICE-4B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Teaching Thinking Models to Reason with Tools

A Full-Pipeline Recipe for Tool-Integrated Reasoning

Model Description

TRICE-4B is a tool-integrated reasoning model built on Qwen3-4B-Thinking-2507, capable of Textual Reasoning Interleaved with Code Execution. Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. To resolve this inherent instability, we move beyond scattered techniques and propose a systematic, full-pipeline recipe spanning data preparation, SFT, the transition from SFT to RL, and RL itself, with the goal of injecting natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability. The resulting TRICE-4B achieves state-of-the-art TIR performance, surpassing both existing TIR methods and frontier open-source reasoning models at the same or even larger parameter scales.

Key Highlights

State-of-the-art <10B TIR performance. TRICE-4B reaches 96.7% on AIME 2025, 86.7% on HMMT 2025, 71.3% on BeyondAIME, and 72.2% average across five competition-level math benchmarks with tools, yielding +14.0% average improvement over the base Qwen3-4B-Thinking-2507.
No-tool reasoning is preserved. The TIR capability is injected without degrading intrinsic reasoning mode — TRICE-4B retains or even improves the text-only reasoning ability on most benchmarks.
Cross-domain transfer. Although trained only on math data, the learned interleaved reasoning pattern transfers to different domains, with gains of up to +14.5% on FrontierScience, GPQA-Diamond, and LiveCodeBench.

Performance

We use a consistent configuration of 80K maximum rollout length and up to 128 tool calls in a stateful sandbox.

Generalization

Usage

TRICE-4B supports both text-only and tool-integrated inference. For multi-turn tool-integrated reasoning, we recommend deploying via SGLang and pairing it with a stateful Python sandbox for code execution.

Citation

If you find this model or our recipe useful, please cite:

@misc{cheng2026teachingthinkingmodelsreason,
      title={Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning}, 
      author={Qianjia Cheng and Yuchen Zhang and Zhilin Wang and Yuxin Zuo and Shunkai Zhang and Yuchen Fan and Yu Qiao and Bowen Zhou and Ning Ding and Yu Cheng and Yun Luo and Ganqu Cui},
      year={2026},
      eprint={2605.06326},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.06326}, 
}

Acknowledgements

TRICE-4B is built on top of Qwen3-4B-Thinking-2507. Training is conducted with the Slime framework. We thank the open-source community for the models, tools, benchmarks, and infrastructure that made this work possible.