Video-Text-to-Text
Transformers
Safetensors
English
qwen2_5_vl
image-text-to-text
video-understanding
reasoning
multimodal
reinforcement-learning
question-answering
text-generation-inference
Instructions to use Falconss1/VideoThinker-R1-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Falconss1/VideoThinker-R1-3B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Falconss1/VideoThinker-R1-3B") model = AutoModelForImageTextToText.from_pretrained("Falconss1/VideoThinker-R1-3B") - Notebooks
- Google Colab
- Kaggle
Improve model card and link to paper (#1)
Browse files- Improve model card and link to paper (14e95849f0d63c55636215399467fc62daa6d05c)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -8,30 +8,42 @@ datasets:
|
|
| 8 |
- MVBench
|
| 9 |
- TempCompass
|
| 10 |
- Video-MME
|
| 11 |
-
language:
|
|
|
|
|
|
|
| 12 |
license: mit
|
|
|
|
| 13 |
tags:
|
| 14 |
- video-understanding
|
| 15 |
- reasoning
|
| 16 |
- multimodal
|
| 17 |
- reinforcement-learning
|
| 18 |
- question-answering
|
| 19 |
-
library_name: transformers
|
| 20 |
-
pipeline_tag: video-text-to-text
|
| 21 |
---
|
| 22 |
|
| 23 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
The
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
```BibTeX
|
| 35 |
@inproceedings{wu2026videothinker,
|
| 36 |
title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
|
| 37 |
author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
|
|
|
|
| 8 |
- MVBench
|
| 9 |
- TempCompass
|
| 10 |
- Video-MME
|
| 11 |
+
language:
|
| 12 |
+
- en
|
| 13 |
+
library_name: transformers
|
| 14 |
license: mit
|
| 15 |
+
pipeline_tag: video-text-to-text
|
| 16 |
tags:
|
| 17 |
- video-understanding
|
| 18 |
- reasoning
|
| 19 |
- multimodal
|
| 20 |
- reinforcement-learning
|
| 21 |
- question-answering
|
|
|
|
|
|
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# VideoThinker-R1-3B
|
| 25 |
+
|
| 26 |
+
[**Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs**](https://huggingface.co/papers/2605.01324)
|
| 27 |
+
|
| 28 |
+
VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. It addresses the phenomenon of "perceptual bias," where reinforcement learning can compel lightweight models to adopt perceptual shortcuts from data rather than developing genuine reasoning abilities.
|
| 29 |
|
| 30 |
+
The framework employs a two-stage debiasing process:
|
| 31 |
+
1. **Bias Aware Training**: Forges a dedicated "bias model" to embody shortcut behaviors.
|
| 32 |
+
2. **Causal Debiasing Policy Optimization (CDPO)**: Fine-tunes the primary model using a repulsive objective to push it away from the bias model's flawed logic.
|
| 33 |
|
| 34 |
+
## Performance
|
| 35 |
+
VideoThinker-R1 establishes a new state-of-the-art in video reasoning efficiency. Using only 1K training samples and no Supervised Fine-Tuning (SFT), it:
|
| 36 |
+
- Surpasses VideoRFT-3B by 7% on VideoMME.
|
| 37 |
+
- Outperforms larger models (e.g., Video-UTR-7B) on reasoning-heavy benchmarks like MVBench and TempCompass.
|
| 38 |
|
| 39 |
+
## Resources
|
| 40 |
+
- **Code**: [GitHub - falonss703/VideoThinker](https://github.com/falonss703/VideoThinker)
|
| 41 |
+
- **Paper**: [Hugging Face Papers](https://huggingface.co/papers/2605.01324)
|
| 42 |
|
| 43 |
+
## Citation
|
| 44 |
+
If you find this project useful in your research, please consider citing:
|
| 45 |
|
| 46 |
+
```bibtex
|
|
|
|
| 47 |
@inproceedings{wu2026videothinker,
|
| 48 |
title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
|
| 49 |
author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
|