daeunni
/

VisionCoach-7B

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

Improve model card: add metadata, paper info and links

by nielsr HF Staff - opened Mar 18

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+42

-4

Files changed (1) hide show

README.md +42 -4

README.md CHANGED Viewed

@@ -1,9 +1,28 @@
-## Base_model
-- Qwen/Qwen2.5-VL-7B-Instruct
 ## Training Data
-We use same dataset from [Open-o3-video](https://huggingface.co/datasets/marinero4972/Open-o3-Video/tree/main)
 | Stage | Dataset |
 |-------|---------|
 | SFT | STGR-SFT.json |
@@ -16,4 +35,23 @@ from transformers import AutoModelForCausalLM, AutoProcessor
 model = AutoModelForCausalLM.from_pretrained("danaleee/VisionCoach-7B")
 processor = AutoProcessor.from_pretrained("danaleee/VisionCoach-7B")
-```

+---
+pipeline_tag: video-text-to-text
+library_name: transformers
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+---
+# VisionCoach-7B
+[**VisionCoach**](https://visioncoach.github.io/) is an input-adaptive reinforcement learning (RL) framework designed to improve spatio-temporal grounding in video reasoning via visual-perception prompting as training-time guidance. The model internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference.
+- **Paper:** [VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting](https://huggingface.co/papers/2603.14659)
+- **Project Page:** [https://visioncoach.github.io/](https://visioncoach.github.io/)
+- **Repository:** [https://github.com/daeunni/VisionCoach](https://github.com/daeunni/VisionCoach)
+## Model Description
+VisionCoach addresses the challenge of reliable spatio-temporal grounding in video reasoning. It consists of two main components:
+1. **Visual Prompt Selector:** Predicts appropriate prompt types conditioned on the video and question.
+2. **Spatio-Temporal Reasoner:** Optimized with RL under visual prompt guidance and object-aware grounding rewards.
+## Base Model
+- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
 ## Training Data
+We use the same dataset from [Open-o3-video](https://huggingface.co/datasets/marinero4972/Open-o3-Video/tree/main).
 | Stage | Dataset |
 |-------|---------|
 | SFT | STGR-SFT.json |
 model = AutoModelForCausalLM.from_pretrained("danaleee/VisionCoach-7B")
 processor = AutoProcessor.from_pretrained("danaleee/VisionCoach-7B")
+```
+## Citation
+If you find this work helpful, please consider citing:
+```bibtex
+@misc{lee2026visioncoachreinforcinggroundedvideo,
+      title={VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting},
+      author={Daeun Lee and Shoubin Yu and Yue Zhang and Mohit Bansal},
+      year={2026},
+      eprint={2603.14659},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2603.14659},
+}
+```
+## Acknowledgements
+We sincerely thank the following projects for their contributions to this work: [Open-o3-Video](https://github.com/marinero4972/Open-o3-Video), [Video-R1](https://github.com/tulerfeng/Video-R1), [R1-V](https://github.com/StarsfieldAI/R1-V), and [ObjectMLLM](https://github.com/brown-palm/ObjectMLLM).