File size: 1,135 Bytes

147cb3e
 
ddb5844
1b174c1
147cb3e
 
 
 
c8a30f4
147cb3e
28a1978
147cb3e
28a1978
43a4bad
28a1978
1b174c1

---
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
---

**<center><span style="font-size:2em;">TinyLLaVA-Video-R1</span></center>**

[![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2504.09641)[![Github](https://img.shields.io/badge/Github-Github-blue.svg)](https://github.com/ZhangXJ199/TinyLLaVA-Video-R1)

Here, we introduce a small-scale video reasoning model TinyLLaVA-Video-R1, based on the traceably trained model [TinyLLaVA-Video](https://github.com/ZhangXJ199/TinyLLaVA-Video). After reinforcement learning on general Video-QA datasets, the model not only significantly improves its reasoning and thinking abilities, but also exhibits the emergent characteristic of “aha moments”.

### Result
|                Model (HF Path)                |   Video-MME   |   MVBench   |    MLVU    |    MMVU    | 
| :----------------------------------------: | ------------- | ------- | -------------- | ---------- | 
| [Zhang199/TinyLLaVA-Video-R1](https://huggingface.co/Zhang199/TinyLLaVA-Video-R1)    |   46.6   |   49.5   |   52.4   |   46.9   |