Agents-X
/

PyVision-Image-7B-RL

Image-Text-to-Text

reinforcement-learning

text-generation-inference

Model card Files Files and versions

PyVision-Image-7B-RL / README.md

stzhao's picture

Improve model card and add metadata (#1)

84a6975 1 day ago

|

history blame contribute delete

1.73 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-text-to-text
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	tags:
	- multimodal
	- agent
	- reinforcement-learning
	- qwen
	---

	# PyVision-Image-7B-RL

	[PyVision-RL: Forging Open Agentic Vision Models via RL](https://arxiv.org/abs/2602.20739)

	This is PyVision-Image-7B-RL, a multimodal agentic vision model post-trained from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using the PyVision-RL reinforcement learning framework.

	- Project Page: [https://agent-x.space/pyvision-rl/](https://agent-x.space/pyvision-rl/)
	- Repository: [https://github.com/agents-x-project/PyVision-RL](https://github.com/agents-x-project/PyVision-RL)
	- Paper: [https://arxiv.org/abs/2602.20739](https://arxiv.org/abs/2602.20739)

	## Description

	Reinforcement learning for agentic multimodal models often suffers from "interaction collapse," where models learn to reduce tool usage and multi-turn reasoning. PyVision-RL is a framework designed to stabilize training and sustain interaction using an oversampling-filtering-ranking rollout strategy combined with an accumulative tool reward.

	PyVision-Image-7B-RL is specifically optimized for image understanding tasks and sustained multi-turn tool interaction, demonstrating strong performance and efficiency for scalable multimodal agents.

	## Citation

	If you find this work useful, please cite the following paper:

	```bibtex
	@article{pyvisionrl2026,
	title={PyVision-RL: Forging Open Agentic Vision Models via RL},
	author={Zhao, Shitian and Lin, Shaoheng and Li, Ming and Zhang, Haoquan and Peng, Wenshuo and Zhang, Kaipeng and Wei, Chen},
	journal={arXiv:2602.20739},
	year={2026}
	}
	```