Update README.md

9f5aa81 verified 18 days ago

3.57 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	tags:
	- search-agent
	- rag
	- deep-research
	- 3b
	- nlp
	pipeline_tag: text-generation
	---

	# Overview

	ToolMind-Web-3B is a specialized lightweight agent built on top of the [Nanbeige4-3B-Thinking-2511](https://huggingface.co/Nanbeige/Nanbeige4-3B-Thinking-2511) foundation model.
	Following extensive SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning) focused on search behaviors,
	our model attains leading performance among small-scale models on multiple long-horizon leaderboards like Xbench-Deepsearch, HLE, and GAIA, enabling reliable execution of up to hundreds of consecutive tool invocations.

	<div align="center">

	<img src="nbg_search_performance.png">
	</div>


	# Highlights

	1. Strong Performance at Compact Scale

	ToolMind-Web-3B delivers high-quality long-horizon reasoning and tool-augmented search capabilities while maintaining a lightweight 3B parameter footprint. Despite its compact size, it achieves competitive performance across multiple benchmarks like Xbench-Deepsearch, GAIA, and HLE. The model is evaluated under the [MiroThinkers workflow](https://github.com/MiroMindAI/MiroThinker), ensuring standardized and reproducible assessment.

	2. An Open-Source Complex QA Dataset Synthesized from Wikipedia Entity–Relation Knowledge Graphs

	We provide a rich, structured QA dataset derived from Wikipedia knowledge graphs, designed to support supervised fine-tuning and reinforcement learning of search-augmented agents.

	3. Turn-Level Judge Guided SFT and Reinforcement Learning

	In the supervised fine-tuning (SFT) stage, a turn-level judge identifies which interaction turns should be used for training.
	In the reinforcement learning (RL) stage, a turn-level reward provides feedback to refine the model's multi-turn search and tool invocation behavior.

	<!-- > Note: report, and evaluation details will be released soon. -->

	# Benchmark Results

	\| Model \| GAIA \| BrowseComp \| BrowseComp-zh \| HLE \| Seal-0 \| Xbench-Deepsearch \| Xbench-Deepsearch-10 \| DSQA \|
	\|------\|------\|------------\|---------------\|-----\|--------\|-------------------\|----------------------\|------\|
	\| DeepSeek-V3.2 \| 0.635 \| 0.676 \| 0.65 \| 0.408 \| 0.385 \| 0.71 \| \| / \|
	\| MiniMax-M2 \| 0.757 \| 0.44 \| 0.485 \| 0.318 \| / \| 0.72 \| \| / \|
	\| GLM-4.6 \| 0.719 \| 0.451 \| 0.495 \| 0.304 \| / \| 0.7 \| \| / \|
	\| MiroThinker 8B \| 0.664 \| 0.311 \| 0.402 \| 0.215 \| 0.404 \| 0.606 \| \| / \|
	\| AgentCPM-Explore 4B \| 0.639 \| 0.25 \| 0.29 \| 0.191 \| 0.4 \| 0.7 \| / \| / \|
	\| Ours\|
	\| ToolMind-Web-3B(w Synthetic QA only) \| 0.583 \| 0.144 \| 0.301 \| 0.224 \| 0.36 \| 0.76 \| 0.3 \| 0.308 \|
	\| ToolMind-Web-3B \| 0.670 \| 0.174 \| 0.308 \| 0.248 \| 0.477 \| 0.751 \| 0.37 \| 0.458 \|

	# <span id="Limitations">Limitations</span>

	While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information.
	<br>

	# <span id="Limitations">Citation</span>
	If you find our model useful or want to use it in your projects, please cite this project.
	<br>

	# <span id="Limitations">Contact</span>
	If you have any questions, please raise an issue or contact us at nanbeige@126.com.
	<br>