| | --- |
| | language: |
| | - en |
| | - zh |
| | license: apache-2.0 |
| | tags: |
| | - search-agent |
| | - rag |
| | - deep-research |
| | - 3b |
| | - nlp |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Overview |
| |
|
| | ToolMind-Web-3B is a specialized lightweight agent built on top of the [**Nanbeige4-3B-Thinking-2511**](https://huggingface.co/Nanbeige/Nanbeige4-3B-Thinking-2511) foundation model. |
| | Following extensive **SFT (Supervised Fine-Tuning)** and **RL (Reinforcement Learning)** focused on search behaviors, |
| | our model attains leading performance among small-scale models on multiple long-horizon leaderboards like **Xbench-Deepsearch, HLE, and GAIA**, enabling reliable execution of up to hundreds of consecutive tool invocations. |
| |
|
| | <div align="center"> |
| |
|
| | <img src="nbg_search_performance.png"> |
| | </div> |
| |
|
| |
|
| | # Highlights |
| |
|
| | 1. **Strong Performance at Compact Scale** |
| |
|
| | ToolMind-Web-3B delivers high-quality long-horizon reasoning and tool-augmented search capabilities while maintaining a lightweight 3B parameter footprint. Despite its compact size, it achieves competitive performance across multiple benchmarks like **Xbench-Deepsearch, GAIA, and HLE**. The model is evaluated under the **[MiroThinkers workflow](https://github.com/MiroMindAI/MiroThinker)**, ensuring standardized and reproducible assessment. |
| |
|
| | 2. **An Open-Source Complex QA Dataset Synthesized from Wikipedia Entity–Relation Knowledge Graphs** |
| |
|
| | We provide a rich, structured QA dataset derived from Wikipedia knowledge graphs, designed to support supervised fine-tuning and reinforcement learning of search-augmented agents. |
| |
|
| | 3. **Turn-Level Judge Guided SFT and Reinforcement Learning** |
| |
|
| | In the supervised fine-tuning (SFT) stage, a **turn-level judge** identifies which interaction turns should be used for training. |
| | In the reinforcement learning (RL) stage, a **turn-level reward** provides feedback to refine the model's multi-turn search and tool invocation behavior. |
| |
|
| | <!-- > **Note:** *report, and evaluation details will be released soon.* --> |
| |
|
| | # Benchmark Results |
| |
|
| | | Model | GAIA | BrowseComp | BrowseComp-zh | HLE | Seal-0 | Xbench-Deepsearch | Xbench-Deepsearch-10 | DSQA | |
| | |------|------|------------|---------------|-----|--------|-------------------|----------------------|------| |
| | | DeepSeek-V3.2 | 0.635 | 0.676 | 0.65 | 0.408 | 0.385 | 0.71 | | / | |
| | | MiniMax-M2 | 0.757 | 0.44 | 0.485 | 0.318 | / | 0.72 | | / | |
| | | GLM-4.6 | 0.719 | 0.451 | 0.495 | 0.304 | / | 0.7 | | / | |
| | | MiroThinker 8B | 0.664 | 0.311 | 0.402 | 0.215 | 0.404 | 0.606 | | / | |
| | | AgentCPM-Explore 4B | 0.639 | 0.25 | 0.29 | 0.191 | 0.4 | 0.7 | / | / | |
| | | **Ours**| |
| | | **ToolMind-Web-3B(w Synthetic QA only)** | 0.583 | 0.144 | 0.301 | 0.224 | 0.36 | 0.76 | 0.3 | 0.308 | |
| | | **ToolMind-Web-3B** | 0.670 | 0.174 | 0.308 | 0.248 | 0.477 | 0.751 | 0.37 | 0.458 | |
| |
|
| | # <span id="Limitations">Limitations</span> |
| |
|
| | While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information. |
| | <br> |
| |
|
| | # <span id="Limitations">Citation</span> |
| | If you find our model useful or want to use it in your projects, please cite this project. |
| | <br> |
| |
|
| | # <span id="Limitations">Contact</span> |
| | If you have any questions, please raise an issue or contact us at nanbeige@126.com. |
| | <br> |