Video-Text-to-Text
Transformers
Safetensors
English
qwen3_vl
image-text-to-text
video
long-video
reasoning
tool-calling
agentic-rl
grpo
multimodal
Instructions to use ParaVT/ParaVT-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ParaVT/ParaVT-8B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ParaVT/ParaVT-8B") model = AutoModelForImageTextToText.from_pretrained("ParaVT/ParaVT-8B") - Notebooks
- Google Colab
- Kaggle
| base_model: | |
| - Qwen/Qwen3-VL-8B-Instruct | |
| datasets: | |
| - ParaVT/ParaVT-Parquet | |
| - ParaVT/ParaVT-Source | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: video-text-to-text | |
| language: | |
| - en | |
| tags: | |
| - video | |
| - long-video | |
| - reasoning | |
| - tool-calling | |
| - agentic-rl | |
| - grpo | |
| - multimodal | |
| # ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning | |
| <div align="center"> | |
| [](https://arxiv.org/abs/2605.20342) | |
| [](https://evolvinglmms-lab.github.io/ParaVT/) | |
| [](https://github.com/EvolvingLMMs-Lab/ParaVT) | |
| [](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet) | |
| [](https://huggingface.co/datasets/ParaVT/ParaVT-Source) | |
| [](https://huggingface.co/papers/2605.20342) | |
| </div> | |
| ## Overview | |
| Training large multimodal models (LMMs) via reinforcement learning to natively invoke video-processing tools (such as temporal cropping) has become a promising route to long-video understanding. Existing native-RL methods, however, dispatch tool calls sequentially (one per turn): a single wrong crop propagates errors without peer correction, multi-turn calls corrupt context, and inference cost scales linearly with the number of turns. | |
| **ParaVT** is the first multi-agent end-to-end RL-trained framework for **Para**llel **V**ideo **T**ool calling: it dispatches multiple time-window crops in a single turn for cleaner context and better fault tolerance. Applying standard RL to ParaVT surfaces an obstacle we term the *Tool Prior Paradox*, where the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose a skip-tool reward shortcut under temperature sampling. We address this with **PARA-GRPO** (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, and a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. | |
| ## Model Card | |
| This repository hosts the final post-RL checkpoint (`ParaVT-8B`), obtained by running PARA-GRPO on top of the cold-start SFT checkpoint [`mwxely/ParaVT-8B-SFT`](https://huggingface.co/mwxely/ParaVT-8B-SFT). The base architecture is `Qwen3VLForConditionalGeneration`, identical to [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct); only the language-model weights are updated. | |
| | Field | Value | | |
| |---|---| | |
| | Architecture | `Qwen3VLForConditionalGeneration` | | |
| | Parameters | 8 B | | |
| | Base model | `Qwen/Qwen3-VL-8B-Instruct` | | |
| | Training stages | Cold-start SFT (500 steps) → PARA-GRPO RL (54 steps) | | |
| | Training data | [`ParaVT/ParaVT-Parquet`](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet) (`sft` + `rl` configs) | | |
| | Source videos | [`ParaVT/ParaVT-Source`](https://huggingface.co/datasets/ParaVT/ParaVT-Source) | | |
| | Native tool | Temporal cropping (start time, end time, optional sub-frame count) | | |
| ## Usage | |
| `ParaVT-8B` is a drop-in `transformers` / `vllm` model for video-text-to-text. The full evaluation driver, prompt templates, and reproduction scripts live in the [ParaVT GitHub repository](https://github.com/EvolvingLMMs-Lab/ParaVT); please refer to it for the exact environment that produced the reported numbers. | |
| ```bash | |
| # Reproduce the headline numbers (after installing the eval venv) | |
| git clone https://github.com/EvolvingLMMs-Lab/ParaVT.git && cd ParaVT | |
| cp .secrets.env.example .secrets.env && $EDITOR .secrets.env | |
| bash scripts/setup_env.sh eval | |
| PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \ | |
| bash paravt/eval/scripts/reproduce_paravt_8b.sh | |
| ``` | |
| For inference outside the eval driver, treat the model exactly like `Qwen/Qwen3-VL-8B-Instruct`: vLLM `--model ParaVT/ParaVT-8B`, the same tokenizer, the same chat template. The agentic system prompt and the tool schema used during PARA-GRPO are documented in [`paravt/eval/configs/withtool.yaml`](https://github.com/EvolvingLMMs-Lab/ParaVT/blob/main/paravt/eval/configs/withtool.yaml) and [`paravt/eval/utils.py`](https://github.com/EvolvingLMMs-Lab/ParaVT/blob/main/paravt/eval/utils.py). | |
| ## Citation | |
| If you find ParaVT useful for your research and applications, please cite: | |
| ```bibtex | |
| @misc{yang2026paravt, | |
| title={{ParaVT}: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning}, | |
| author={Zuhao Yang and Kaichen Zhang and Sudong Wang and Keming Wu and Zhongyu Yang and Bo Li and Xiaojuan Qi and Shijian Lu and Xingxuan Li and Lidong Bing}, | |
| year={2026}, | |
| eprint={2605.20342}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| ParaVT builds on the [LongVT](https://github.com/EvolvingLMMs-Lab/LongVT) (CVPR 2026) framework for native video tool calling, the [`lmms-engine`](https://github.com/EvolvingLMMs-Lab/lmms-engine) cold-start SFT infrastructure, the [`AReaL`](https://github.com/inclusionAI/AReaL) RL training stack, and the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) evaluation harness. We thank the maintainers of all of the above. | |