Improve model card metadata and documentation (#1)

dd421cb 2 days ago

1.59 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	license: apache-2.0
	library_name: transformers
	pipeline_tag: video-text-to-text
	---

	# PyVision-Video-7B-SFT

	[PyVision-RL: Forging Open Agentic Vision Models via RL](https://arxiv.org/abs/2602.20739)

	This is PyVision-Video-7B-SFT, post-trained from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).

	- Project Page: [https://agent-x.space/pyvision-rl/](https://agent-x.space/pyvision-rl/)
	- Repository: [https://github.com/agents-x-project/PyVision-RL](https://github.com/agents-x-project/PyVision-RL)
	- Paper: [arXiv:2602.20739](https://arxiv.org/abs/2602.20739)

	## Model Description
	PyVision-Video is part of the PyVision-RL framework, which aims to stabilize Reinforcement Learning (RL) training for open-weight multimodal models to sustain agentic interaction.

	For video reasoning, PyVision-Video employs an on-demand context construction strategy. It selectively samples task-relevant frames during the reasoning process, which significantly reduces visual token usage while maintaining strong performance on complex video understanding tasks. This model serves as the Supervised Fine-Tuning (SFT) checkpoint before RL training.

	## Citation
	If you find this work useful, please cite the following paper:

	```bibtex
	@article{pyvisionrl2026,
	title={PyVision-RL: Forging Open Agentic Vision Models via RL},
	author={Zhao, Shitian and Lin, Shaoheng and Li, Ming and Zhang, Haoquan and Peng, Wenshuo and Zhang, Kaipeng and Wei, Chen},
	journal={arXiv:2602.20739},
	year={2026}
	}
	```