Improve model card and link to paper

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -10
README.md CHANGED
@@ -8,30 +8,42 @@ datasets:
8
  - MVBench
9
  - TempCompass
10
  - Video-MME
11
- language: en
 
 
12
  license: mit
 
13
  tags:
14
  - video-understanding
15
  - reasoning
16
  - multimodal
17
  - reinforcement-learning
18
  - question-answering
19
- library_name: transformers
20
- pipeline_tag: video-text-to-text
21
  ---
22
 
23
- # Paper abstract
 
 
 
 
24
 
25
- The abstract of the paper is the following:
 
 
26
 
27
- Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments.To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities.Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions.Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass.
 
 
 
28
 
29
- This repository contains the model as presented in "Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs".
 
 
30
 
31
- For training and evaluation, please refer to the Code: https://github.com/falonss703/VideoThinker
 
32
 
33
- If you find this project useful in your research, please consider cite:
34
- ```BibTeX
35
  @inproceedings{wu2026videothinker,
36
  title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
37
  author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
 
8
  - MVBench
9
  - TempCompass
10
  - Video-MME
11
+ language:
12
+ - en
13
+ library_name: transformers
14
  license: mit
15
+ pipeline_tag: video-text-to-text
16
  tags:
17
  - video-understanding
18
  - reasoning
19
  - multimodal
20
  - reinforcement-learning
21
  - question-answering
 
 
22
  ---
23
 
24
+ # VideoThinker-R1-3B
25
+
26
+ [**Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs**](https://huggingface.co/papers/2605.01324)
27
+
28
+ VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. It addresses the phenomenon of "perceptual bias," where reinforcement learning can compel lightweight models to adopt perceptual shortcuts from data rather than developing genuine reasoning abilities.
29
 
30
+ The framework employs a two-stage debiasing process:
31
+ 1. **Bias Aware Training**: Forges a dedicated "bias model" to embody shortcut behaviors.
32
+ 2. **Causal Debiasing Policy Optimization (CDPO)**: Fine-tunes the primary model using a repulsive objective to push it away from the bias model's flawed logic.
33
 
34
+ ## Performance
35
+ VideoThinker-R1 establishes a new state-of-the-art in video reasoning efficiency. Using only 1K training samples and no Supervised Fine-Tuning (SFT), it:
36
+ - Surpasses VideoRFT-3B by 7% on VideoMME.
37
+ - Outperforms larger models (e.g., Video-UTR-7B) on reasoning-heavy benchmarks like MVBench and TempCompass.
38
 
39
+ ## Resources
40
+ - **Code**: [GitHub - falonss703/VideoThinker](https://github.com/falonss703/VideoThinker)
41
+ - **Paper**: [Hugging Face Papers](https://huggingface.co/papers/2605.01324)
42
 
43
+ ## Citation
44
+ If you find this project useful in your research, please consider citing:
45
 
46
+ ```bibtex
 
47
  @inproceedings{wu2026videothinker,
48
  title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
49
  author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},