yhx12
/

VideoSSR

Safetensors

qwen3_vl

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library name, license, abstract, and project details

by nielsr HF Staff - opened Nov 13, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+44

-1

Files changed (1) hide show

README.md +44 -1

README.md CHANGED Viewed

@@ -1,5 +1,48 @@
 **VideoSSR-8B** is a multimodal large language model (MLLM) fine-tuned from `Qwen-VL-8B-Instruct` for enhanced video understanding. It is trained using a novel **Video Self-Supervised Reinforcement Learning (VideoSSR)** framework, which generates its own high-quality training data directly from videos, eliminating the need for manual annotation.
 - **Base Model:** `Qwen-VL-8B-Instruct`
 - **Paper:** [VideoSSR: Video Self-Supervised Reinforcement Learning](https://arxiv.org/abs/2511.06281)
-- **Code:** [https://github.com/lcqysl/VideoSSR](https://github.com/lcqysl/VideoSSR)

+---
+pipeline_tag: video-text-to-text
+library_name: transformers
+license: apache-2.0
+---
+# VideoSSR: Video Self-Supervised Reinforcement Learning
+[![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B)](https://arxiv.org/abs/2511.06281)
+[![Hugging Face Models](https://img.shields.io/badge/Hugging%20Face-Models-yellow?logo=huggingface)](https://huggingface.co/yhx12/VideoSSR)
+[![Hugging Face Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-yellow?logo=huggingface)](https://huggingface.co/datasets/yhx12/VideoSSR-30k)
+[![Hugging Face Benchmark](https://img.shields.io/badge/Hugging%20Face-Benchmark-yellow?logo=huggingface)](https://huggingface.co/datasets/yhx12/VIUBench)
 **VideoSSR-8B** is a multimodal large language model (MLLM) fine-tuned from `Qwen-VL-8B-Instruct` for enhanced video understanding. It is trained using a novel **Video Self-Supervised Reinforcement Learning (VideoSSR)** framework, which generates its own high-quality training data directly from videos, eliminating the need for manual annotation.
 - **Base Model:** `Qwen-VL-8B-Instruct`
 - **Paper:** [VideoSSR: Video Self-Supervised Reinforcement Learning](https://arxiv.org/abs/2511.06281)
+- **Code/Project Page:** [https://github.com/lcqysl/VideoSSR](https://github.com/lcqysl/VideoSSR)
+## Paper Abstract
+Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at this https URL .
+**Authors:** Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng
+## Related Hugging Face Resources
+*   **Dataset:** [VideoSSR-30K](https://huggingface.co/datasets/yhx12/VideoSSR-30k)
+*   **Benchmark:** [VIUBench](https://huggingface.co/datasets/yhx12/VIUBench)
+## Model Details
+VideoSSR is a novel framework designed to enhance the video understanding capabilities of Multimodal Large Language Models (MLLMs). Instead of relying on prohibitively expensive manually annotated data or biased model-annotated data, VideoSSR harnesses the rich, intrinsic information within videos to generate high-quality, verifiable training data. We introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. Building upon these tasks, we construct the VideoSSR-30K dataset and train models with Reinforcement Learning with Verifiable Rewards (RLVR), establishing a potent foundational framework for developing more advanced video understanding in MLLMs.
+### Pretext Tasks
+![](https://raw.githubusercontent.com/lcqysl/VideoSSR/main/assets/pretext_tasks.png)
+### VIUBench
+To rigorously test the capabilities of modern MLLMs on fundamental video understanding, we introduce the **V**ideo **I**ntrinsic **U**nderstanding **Bench**mark (**VIUBench**). This benchmark is systematically constructed from our three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. It specifically evaluates a model's ability to reason about intrinsic video properties—such as temporal coherence and fine-grained details—independent of external annotations. Our results show that VIUBench poses a significant challenge even for the most advanced models, highlighting a critical area for improvement and validating the effectiveness of our approach.
+![](https://raw.githubusercontent.com/lcqysl/VideoSSR/main/assets/VIUBench.png)
+### Performance Highlights
+![](https://raw.githubusercontent.com/lcqysl/VideoSSR/main/assets/performance1.png)
+![](https://raw.githubusercontent.com/lcqysl/VideoSSR/main/assets/performance2.png)
+## Acknowledgement
+This work was developed upon **[verl](https://github.com/volcengine/verl)**. We also thank the great work of **[Visual Jigsaw](https://github.com/penghao-wu/visual_jigsaw)** for the inspiration.