mwxely

Add Website + Daily Paper badges

e399087 3 days ago

5.65 kB

	---
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	datasets:
	- ParaVT/ParaVT-Parquet
	- ParaVT/ParaVT-Source
	license: apache-2.0
	library_name: transformers
	pipeline_tag: video-text-to-text
	language:
	- en
	tags:
	- video
	- long-video
	- reasoning
	- tool-calling
	- agentic-rl
	- grpo
	- multimodal
	---

	# ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

	<div align="center">

	[![Paper](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2605.20342)
	[![Website](https://img.shields.io/badge/Website-000000?style=for-the-badge&logo=google-chrome&logoColor=white)](https://evolvinglmms-lab.github.io/ParaVT/)
	[![Code](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/EvolvingLMMs-Lab/ParaVT)
	[![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff)](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet)
	[![Source](https://img.shields.io/badge/Source-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff)](https://huggingface.co/datasets/ParaVT/ParaVT-Source)
	[![Daily Paper](https://img.shields.io/badge/🚀_Daily_Paper-FF9D00?style=for-the-badge)](https://huggingface.co/papers/2605.20342)

	</div>

	## Overview

	Training large multimodal models (LMMs) via reinforcement learning to natively invoke video-processing tools (such as temporal cropping) has become a promising route to long-video understanding. Existing native-RL methods, however, dispatch tool calls sequentially (one per turn): a single wrong crop propagates errors without peer correction, multi-turn calls corrupt context, and inference cost scales linearly with the number of turns.

	ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling: it dispatches multiple time-window crops in a single turn for cleaner context and better fault tolerance. Applying standard RL to ParaVT surfaces an obstacle we term the Tool Prior Paradox, where the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose a skip-tool reward shortcut under temperature sampling. We address this with PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, and a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it.

	## Model Card

	This repository hosts the final post-RL checkpoint (`ParaVT-8B`), obtained by running PARA-GRPO on top of the cold-start SFT checkpoint [`mwxely/ParaVT-8B-SFT`](https://huggingface.co/mwxely/ParaVT-8B-SFT). The base architecture is `Qwen3VLForConditionalGeneration`, identical to [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct); only the language-model weights are updated.

	\| Field \| Value \|
	\|---\|---\|
	\| Architecture \| `Qwen3VLForConditionalGeneration` \|
	\| Parameters \| 8 B \|
	\| Base model \| `Qwen/Qwen3-VL-8B-Instruct` \|
	\| Training stages \| Cold-start SFT (500 steps) → PARA-GRPO RL (54 steps) \|
	\| Training data \| [`ParaVT/ParaVT-Parquet`](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet) (`sft` + `rl` configs) \|
	\| Source videos \| [`ParaVT/ParaVT-Source`](https://huggingface.co/datasets/ParaVT/ParaVT-Source) \|
	\| Native tool \| Temporal cropping (start time, end time, optional sub-frame count) \|

	## Usage

	`ParaVT-8B` is a drop-in `transformers` / `vllm` model for video-text-to-text. The full evaluation driver, prompt templates, and reproduction scripts live in the [ParaVT GitHub repository](https://github.com/EvolvingLMMs-Lab/ParaVT); please refer to it for the exact environment that produced the reported numbers.

	```bash
	# Reproduce the headline numbers (after installing the eval venv)
	git clone https://github.com/EvolvingLMMs-Lab/ParaVT.git && cd ParaVT
	cp .secrets.env.example .secrets.env && $EDITOR .secrets.env
	bash scripts/setup_env.sh eval
	PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \
	bash paravt/eval/scripts/reproduce_paravt_8b.sh
	```

	For inference outside the eval driver, treat the model exactly like `Qwen/Qwen3-VL-8B-Instruct`: vLLM `--model ParaVT/ParaVT-8B`, the same tokenizer, the same chat template. The agentic system prompt and the tool schema used during PARA-GRPO are documented in [`paravt/eval/configs/withtool.yaml`](https://github.com/EvolvingLMMs-Lab/ParaVT/blob/main/paravt/eval/configs/withtool.yaml) and [`paravt/eval/utils.py`](https://github.com/EvolvingLMMs-Lab/ParaVT/blob/main/paravt/eval/utils.py).

	## Citation

	If you find ParaVT useful for your research and applications, please cite:

	```bibtex
	@misc{yang2026paravt,
	title={{ParaVT}: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning},
	author={Zuhao Yang and Kaichen Zhang and Sudong Wang and Keming Wu and Zhongyu Yang and Bo Li and Xiaojuan Qi and Shijian Lu and Xingxuan Li and Lidong Bing},
	year={2026},
	eprint={2605.20342},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## Acknowledgements

	ParaVT builds on the [LongVT](https://github.com/EvolvingLMMs-Lab/LongVT) (CVPR 2026) framework for native video tool calling, the [`lmms-engine`](https://github.com/EvolvingLMMs-Lab/lmms-engine) cold-start SFT infrastructure, the [`AReaL`](https://github.com/inclusionAI/AReaL) RL training stack, and the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) evaluation harness. We thank the maintainers of all of the above.