Overview
ToolMind-web-3B is a specialized lightweight agent built on top of the Nanbeige4-3B-Thinking-2511 foundation model. Following extensive SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning) focused on search behaviors, our model attains leading performance among small-scale models on multiple long-horizon leaderboards like Xbench-Deepsearch, HLE, and GAIA, enabling reliable execution of up to hundreds of consecutive tool invocations.
Highlights
- Strong Performance at Compact Scale
ToolMind-web-3B delivers high-quality long-horizon reasoning and tool-augmented search capabilities while maintaining a lightweight 3B parameter footprint. Despite its compact size, it achieves competitive performance across multiple benchmarks like Xbench-Deepsearch, GAIA, and HLE. The model is evaluated under the MiroThinkers workflow, ensuring standardized and reproducible assessment.
- An Open-Source Complex QA Dataset Synthesized from Wikipedia Entity–Relation Knowledge Graphs
We provide a rich, structured QA dataset derived from Wikipedia knowledge graphs, designed to support supervised fine-tuning and reinforcement learning of search-augmented agents.
- Turn-Level Judge Guided SFT and Reinforcement Learning
In the supervised fine-tuning (SFT) stage, a turn-level judge identifies which interaction turns should be used for training. In the reinforcement learning (RL) stage, a turn-level reward provides feedback to refine the model's multi-turn search and tool invocation behavior.
Note: Training data, full report, and evaluation details will be released soon.
Benchmark Results
| Model | GAIA | BrowseComp | BrowseComp-zh | HLE | Seal-0 | Xbench-Deepsearch | Xbench-Deepsearch-10 | DSQA |
|---|---|---|---|---|---|---|---|---|
| DeepSeek-V3.2 | 0.635 | 0.676 | 0.65 | 0.408 | 0.385 | 0.71 | / | |
| MiniMax-M2 | 0.757 | 0.44 | 0.485 | 0.318 | / | 0.72 | / | |
| GLM-4.6 | 0.719 | 0.451 | 0.495 | 0.304 | / | 0.7 | / | |
| MiroThinker 8B | 0.664 | 0.311 | 0.402 | 0.215 | 0.404 | 0.606 | / | |
| AgentCPM-Explore 4B | 0.639 | 0.25 | 0.29 | 0.191 | 0.4 | 0.7 | / | / |
| Ours | ||||||||
| ToolMind-web-3B(w Synthetic QA only) | 0.583 | 0.144 | 0.301 | 0.224 | 0.36 | 0.76 | 0.3 | 0.308 |
| ToolMind-web-3B | 0.670 | 0.174 | 0.308 | 0.248 | 0.477 | 0.751 | 0.37 | 0.458 |
Limitations
While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information.
Citation
If you find our model useful or want to use it in your projects, please cite this project.
Contact
If you have any questions, please raise an issue or contact us at nanbeige@126.com.
- Downloads last month
- 14