Vchitect
/

ShotVL-3B

Image-Text-to-Text

vision-language

text-generation-inference

Model card Files Files and versions

Alexislhb commited on Jun 30, 2025

Commit

1517266

·

verified ·

1 Parent(s): 3e703ac

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -6,13 +6,12 @@ tags:
 - vision-language
 - cinematography
 - shotbench
-- arxiv:2506.21356
 ---
 ## Model description
 This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), trained by supervised fine-tuning and GRPO on the largest and high-quality dataset for cinematic language understanding to date. It currently achieves state-of-the-art performance on [ShotBench](https://vchitect.github.io/ShotBench-project/),  a comprehensive benchmark for evaluating cinematography understanding in vision-language models.
 ###  Demo Code
 **Image**

 - vision-language
 - cinematography
 - shotbench
 ---
 ## Model description
 This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), trained by supervised fine-tuning and GRPO on the largest and high-quality dataset for cinematic language understanding to date. It currently achieves state-of-the-art performance on [ShotBench](https://vchitect.github.io/ShotBench-project/),  a comprehensive benchmark for evaluating cinematography understanding in vision-language models.
+Please visit our [paper](https://arxiv.org/abs/2506.21356) for more details.
 ###  Demo Code
 **Image**