Improve model card: Add pipeline tag, library name, links, and detailed content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +129 -3
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
8
+
9
+ by
10
+ [@marinero4972](https://huggingface.co/marinero4972),
11
+ [Xiangtai Li](https://lxtgh.github.io/),
12
+ [@HaochenWang](https://huggingface.co/HaochenWang),
13
+ [Yue Tan](https://tangent0308.github.io/),
14
+ [Tao Zhang](https://zhang-tao-whu.github.io/),
15
+ [@ldkong](https://huggingface.co/ldkong),
16
+ [Yunhai Tong](https://scholar.google.com/citations?user=T4gqdPkAAAAJ),
17
+ [Anran Wang](https://sites.google.com/view/anranwang/home),
18
+ [Zhiyang Teng](https://scholar.google.com/citations?user=9wOJrf8AAAAJ&hl=zh-CN),
19
+ [Yujing Wang](https://scholar.google.com/citations?user=YgL4rywAAAAJ&hl=zh-CN&oi=ao)
20
+ and
21
+ [Zhuochen Wang](https://scholar.google.com/citations?hl=en&user=RDvwXDsAAAAJ),
22
+
23
+
24
+ [[πŸ“– Paper](https://huggingface.co/papers/2510.20579)] | [[🌟 Project Page](https://marinero4972.github.io/projects/Open-o3-Video/)] | [[πŸ’» Code](https://github.com/marinero4972/Open-o3-Video)] | [[πŸŽ₯ Introduction](https://youtu.be/gymaTVRy0JY)] | [[πŸ€— Model](https://huggingface.co/marinero4972/Open-o3-Video/tree/main)] | [[πŸ’Ύ Data](https://huggingface.co/datasets/marinero4972/Open-o3-Video/tree/main)]
25
+
26
+
27
+ **TL; DR**: Open-o3 Video integrates explicit spatio-temporal evidence (key timestamps and bounding boxes) into video reasoning through curated STGR dataset and a two-stage SFT–RL training strategy, achieving state-of-the-art results on V-STAR and delivering verifiable, reliable reasoning for video understanding.
28
+
29
+ ![](https://github.com/marinero4972/Open-o3-Video/raw/main/assets/teaser.png)
30
+
31
+ **Abstract**: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models suchs as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce **Open-o3 Video**, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, **STGR-CoT-30k for SFT and STGR-RL-36k for RL**, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On **V-STAR** benchmark, Open-o3 Video achieves **state-of-the-art performance**, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
32
+
33
+ **Open-o3 Video Model**:
34
+
35
+ Stage 1: Cold-start initialization on STGR-CoT-30k equips the model with basic grounded reasoning.
36
+
37
+ Stage 2: Reinforcement learning with Group Sequence Policy Optimization stabilizes long-horizon optimization. We propose **adaptive temporal proximity** and **temporal gating** in the thinking reward design.
38
+
39
+ ![](https://github.com/marinero4972/Open-o3-Video/raw/main/assets/model.png)
40
+
41
+ # Quick Start
42
+
43
+ ## Environment setup:
44
+
45
+ ```bash
46
+ git clone https://github.com/marinero4972/Open-o3-Video
47
+ cd Open-o3-Video
48
+
49
+ conda create -n open-o3-video python=3.11
50
+ conda activate open-o3-video
51
+ bash setup.sh
52
+ ```
53
+
54
+ ## Data Preparation:
55
+
56
+ To provide unified spatio-temporal supervision for grounded video reasoning, we build two datasets: STGR-CoT-30k for supervised fine-tuning and STGR-RL-36k for reinforcement learning.
57
+
58
+ Json data download link: [STGR](https://huggingface.co/datasets/marinero4972/Open-o3-Video/tree/main)
59
+
60
+ The overall data structure should be:
61
+ ```sh
62
+ DATA_ROOT
63
+ β”œβ”€β”€ json_data
64
+ β”‚ └── STGR-RL.json
65
+ β”‚ └── STGR-SFT.json
66
+ └── videos
67
+ └── gqa
68
+ └── stgr
69
+ └── plm
70
+ └── temporal_grounding
71
+ └── timerft
72
+ └── treevgr
73
+ └── tvg_r1
74
+ └── videoespresso
75
+ └── videor1
76
+ ```
77
+
78
+ You should refine the DATA_ROOT in [`src/r1-v/configs/data_root.py`](src/r1-v/configs/data_root.py) according to your data path.
79
+
80
+ ## Training:
81
+
82
+ ```bash
83
+ # cold start initialization
84
+ bash ./src/scripts/run_sft_video.sh
85
+
86
+ # reinforcement learning with GSPO
87
+ bash ./src/scripts/run_grpo_video.sh
88
+ ```
89
+
90
+ ## Evaluation:
91
+
92
+ Evaluate on benchmarks:
93
+
94
+ ```bash
95
+ cd eval
96
+ bash ./scripts/eval_all.sh
97
+ ```
98
+
99
+ Infernce on examples:
100
+
101
+ ```bash
102
+ cd eval
103
+ python ./inference_example.py
104
+ ```
105
+
106
+ # License
107
+
108
+ This project is licensed under the [Apache-2.0 License](https://github.com/marinero4972/Open-o3-Video/blob/main/LICENSE).
109
+
110
+
111
+ # Citation
112
+
113
+ If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.
114
+
115
+ ```bibtex
116
+ @article{meng2025open-o3,
117
+ title={Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence},
118
+ author={Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang},
119
+ journal={arXiv preprint arXiv:2510.20579},
120
+ year={2025}
121
+ }
122
+ ```
123
+
124
+ # Acknowledgements
125
+
126
+ We sincerely thank the following projects for their contributions to this work:
127
+
128
+ - [Video-R1](https://github.com/tulerfeng/Video-R1)
129
+ - [R1-V](https://github.com/Deep-Agent/R1-V)