Create README.md

5275f2b verified 6 days ago

4.94 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	- Accio-Lab/Metis-8B-ColdStart
	tags:
	- multimodal
	- vision-language
	- reinforcement-learning
	- tool-use
	- agentic
	- qwen3_vl
	- HDPO
	datasets:
	- Accio-Lab/Metis-RL
	language:
	- en
	pipeline_tag: image-text-to-text
	---

	# Metis-8B-RL

	Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

	Metis-8B-RL is the final RL-trained checkpoint of the Metis framework, trained with Hierarchical Decoupled Policy Optimization (HDPO) on top of [Metis-8B-ColdStart](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart). It is a strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.

	[[Paper (arXiv)]](https://arxiv.org/abs/2604.08545) \| [[GitHub]](https://github.com/Accio-Lab/Metis) \| [[ColdStart Model]](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart) \| [[RL Data]](https://huggingface.co/datasets/Accio-Lab/Metis-RL) \| [[ColdStart Data]](https://huggingface.co/datasets/Accio-Lab/Metis-ColdStart)

	## Highlights

	- 98% → 2% Tool Calls — Reduces blind tool invocation by orders of magnitude.
	- SOTA Performance — Best accuracy across 13 benchmarks among open-source 8B agentic models.
	- Meta-Cognitive Wisdom — Learns when to use tools, not just how.

	## Model Details

	\| Attribute \| Value \|
	\|---\|---\|
	\| Base model \| [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) \|
	\| SFT checkpoint \| [Metis-8B-ColdStart](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart) \|
	\| RL algorithm \| HDPO (Hierarchical Decoupled Policy Optimization) \|
	\| Training data \| [Metis-RL](https://huggingface.co/datasets/Accio-Lab/Metis-RL) (~5K prompts) \|
	\| License \| Apache-2.0 \|

	### HDPO Training Hyperparameters

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Batch size \| 128 \|
	\| Rollouts per prompt (G) \| 16 \|
	\| Learning rate \| 1e-6 \|
	\| KL coefficient \| 0 \|
	\| Loss weights \| w_acc = 1.0, w_tool = 0.15 \|
	\| Max response length \| 16,384 tokens \|

	## Method: Hierarchical Decoupled Policy Optimization (HDPO)

	Current agentic multimodal models suffer from blind tool invocation — they reflexively call external tools even when queries are directly resolvable from the visual context. Existing RL methods attempt to fix this by coupling accuracy and tool-efficiency into a single scalar reward, but this creates an irreconcilable optimization dilemma.

	HDPO resolves this through three key components:

	1. Dual Reward Design — An accuracy reward (r_acc) and a tool-efficiency reward (r_tool) that is conditioned on correctness.
	2. Decoupled Advantage Estimation — Accuracy advantages are computed over all rollouts; tool efficiency advantages are computed exclusively over correct rollouts (conditional GRPO).
	3. Hierarchical Policy Update — Two independent clipped surrogate losses combined as `L_HDPO = w_acc · L_GRPO(A_acc) + w_tool · L_GRPO(A_tool)`.

	This naturally induces an implicit curriculum: first learn to be correct, then learn to be efficient.

	## Evaluation Results

	### Perception and Document Understanding

	\| Model \| V\*Bench \| HR4K \| HR8K \| TreeBench \| MME-RW \| SEED2+ \| CharXiv(DQ) \| CharXiv(RQ) \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Qwen3-VL-8B-Instruct \| 86.4 \| 78.9 \| 74.6 \| 40.7 \| 61.9 \| 71.0 \| 83.0 \| 46.3 \|
	\| DeepEyesV2 \| 81.8 \| 77.9 \| 73.8 \| 42.5 \| 64.9 \| 70.5 \| 78.6 \| 48.9 \|
	\| SenseNova-MARS-8B \| 92.2 \| 83.1 \| 78.4 \| - \| 67.9 \| - \| - \| - \|
	\| Skywork-R1V4-30B-A3B \| 88.0 \| 82.8 \| 79.8 \| - \| 71.4 \| - \| - \| - \|
	\| Metis (Ours) \| 91.1 \| 83.5 \| 82.0 \| 45.2 \| 70.3 \| 72.5 \| 83.4 \| 54.1 \|

	### Mathematical and Logical Reasoning

	\| Model \| MathVista \| MathVerse \| WeMath \| DynaMath \| LogicVista \| Avg. \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Qwen3-VL-8B-Instruct \| 76.3 \| 61.3 \| 38.8 \| 65.5 \| 54.9 \| 59.4 \|
	\| DeepEyesV2 \| 71.9 \| 52.7 \| 38.1 \| 57.2 \| 48.7 \| 53.7 \|
	\| Metis (Ours) \| 78.0 \| 65.9 \| 65.2 \| 69.2 \| 56.2 \| 66.9 \|

	## Usage

	Please refer to the [GitHub repository](https://github.com/Accio-Lab/Metis) for full installation and inference instructions.

	### Installation

	```bash
	git clone https://github.com/Accio-Lab/Metis.git
	cd Metis
	pip install -e verl
	pip install -e ".[vllm,search_tool,python_code_dep]"
	```

	## Citation

	```bibtex
	@article{yan2026metis,
	title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
	author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
	journal={arXiv preprint arXiv:2604.08545},
	year={2026}
	}
	```

	## Acknowledgments

	Metis is built upon [verl](https://github.com/volcengine/verl), [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool), and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL).