MBZUAI
/

ViMUL

Safetensors

llava

Model card Files Files and versions

xet

Community

Enhance model card: Add metadata, GitHub link, and tags

by nielsr HF Staff - opened Sep 18, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+14

-3

Files changed (1) hide show

README.md +14 -3

README.md CHANGED Viewed

@@ -1,12 +1,21 @@
 ---
 license: cc-by-sa-4.0
 ---
 # ViMUL: A Culturally-diverse Multilingual Multimodal Video Model
 [![🤗 Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/MBZUAI/ViMUL)
 [![📄 Paper](https://img.shields.io/badge/📄-Paper-red)](https://huggingface.co/papers/2506.07032)
 [![🌐 Project Page](https://img.shields.io/badge/🌐-Project%20Page-green)](https://mbzuai-oryx.github.io/ViMUL/)
 [![📊 Benchmark](https://img.shields.io/badge/📊-ViMUL--Bench-orange)](https://huggingface.co/datasets/MBZUAI/ViMUL-Bench)
 ## Overview
 ViMUL is a multilingual video Large Multimodal Model (LMM) designed to provide better tradeoffs between high and low-resource languages for video understanding. The model is trained on a machine-translated multilingual video training set comprising 1.2 million samples and demonstrates improved performance across culturally diverse video content in multiple languages.
@@ -75,7 +84,8 @@ def infer(
     video = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().to(device)
     video = [video]
-    qs = DEFAULT_IMAGE_TOKEN + "\n" + prompt
     conv = conv_templates[conv_mode].copy() if conv_mode else conv_templates["default"].copy()
     conv.append_message(conv.roles[0], qs)
     conv.append_message(conv.roles[1], None)
@@ -115,7 +125,8 @@ if __name__ == "__main__":
     prompt = "Describe what happens in the video."
     conv_mode = "qwen_1_5"
     output = infer(model_path, video_path, prompt, conv_mode=conv_mode)
-    print("\n")
     print("="*40)
     print("Output:", output)
     print("="*40)
@@ -132,4 +143,4 @@ if __name__ == "__main__":
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2506.07032},
 }
-```

 ---
 license: cc-by-sa-4.0
+pipeline_tag: video-text-to-text
+library_name: transformers
+tags:
+  - llava
+  - qwen
+  - multilingual
+  - video-understanding
 ---
 # ViMUL: A Culturally-diverse Multilingual Multimodal Video Model
 [![🤗 Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/MBZUAI/ViMUL)
 [![📄 Paper](https://img.shields.io/badge/📄-Paper-red)](https://huggingface.co/papers/2506.07032)
 [![🌐 Project Page](https://img.shields.io/badge/🌐-Project%20Page-green)](https://mbzuai-oryx.github.io/ViMUL/)
 [![📊 Benchmark](https://img.shields.io/badge/📊-ViMUL--Bench-orange)](https://huggingface.co/datasets/MBZUAI/ViMUL-Bench)
+[![GitHub](https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github)](https://github.com/mbzuai-oryx/ViMUL)
 ## Overview
 ViMUL is a multilingual video Large Multimodal Model (LMM) designed to provide better tradeoffs between high and low-resource languages for video understanding. The model is trained on a machine-translated multilingual video training set comprising 1.2 million samples and demonstrates improved performance across culturally diverse video content in multiple languages.
     video = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().to(device)
     video = [video]
+    qs = DEFAULT_IMAGE_TOKEN + "
+" + prompt
     conv = conv_templates[conv_mode].copy() if conv_mode else conv_templates["default"].copy()
     conv.append_message(conv.roles[0], qs)
     conv.append_message(conv.roles[1], None)
     prompt = "Describe what happens in the video."
     conv_mode = "qwen_1_5"
     output = infer(model_path, video_path, prompt, conv_mode=conv_mode)
+    print("
+")
     print("="*40)
     print("Output:", output)
     print("="*40)
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2506.07032},
 }
+```