---
library_name: transformers
tags:
- multimodal
- multilingual
- vlm
- translation
language:
- en
- de
- nl
- es
- fr
- pt
- uk
- hi
- zh
- ru
- cs
- ko
- ja
- it
- pl
- ro
- nb
- nn
base_model:
- utter-project/TowerVision-9B
pipeline_tag: video-text-to-text
license: cc-by-nc-sa-4.0
---
# Model Card for TowerVideo
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning.
- **Point of Contact**: X (add some email here)
- **License**: Apache 2.0
- **Model Family**: TowerVision (2B, 9B variants)
- **Context length**: 8192 tokens
- **Languages**: 20+ languages including European, Asian, and other language families
🌟 Try TowerVision: [Project Page](https://guilhermeviveiros.github.io/TowerVision.io/) | [Code Repository](https://github.com/GuilhermeViveiros/LLaVA-NeXT)
## Available Models
| Model | Parameters | HF Link |
|-------|------------|---------|
| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVideo-2B)
| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVideo-9B)
## How to Use TowerVision
### Quick Start with Transformers
Click to expand/collapse code
```python
mport torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
# Load the model in half-precision
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
"utter-project/TowerVideo-2B",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"utter-project/TowerVideo-2B"
)
# Use your local video
video_path = "your_video_path.mp4"
# Conversation using the same template
conversation = [
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "\n
## Model Details
**Input**: Model accepts input text, images and video.
**Output**: Model generates text in multiple languages.
**Model Architecture**: TowerVideo uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
**Languages Covered**: The model has been trained on **20 languages and dialects**:
- **European languages**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk)
- **Asian languages**: Chinese (Simplified & Traditional), Japanese, Korean, Hindi
- **Other languages**: Russian, Ukrainian
**Key Strengths**:
- **🏆 Exceptional performance on culturally-aware benchmarks** with deep understanding of cultural contexts and visual nuances
- **📊 Strong cross-lingual transfer capabilities** across diverse vision-language tasks
## Training Data
TowerVision models are trained on a video/text subset of **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
| Dataset | Samples | HF Link | |
|---------|---------|---------|-------|
| VisionBlocks | 6.31M | [🤗 utter-project/VisionBlocks](https://huggingface.co/datasets/utter-project/VisionBlocks) | Coming Soon |
### Dataset Statistics
- **Total samples**: 6.31M
- **Created by our team**: 1.21M samples (~19%)
- **Human-collected/external**: 5.10M samples (~81%)
### Dataset Composition Overview
**VisionBlocks** contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data:
- **Chart/Plot Reasoning**: DVQA, ChartQA, PlotQA, TabMWP (~405K samples)
- **General VQA**: VQAv2, RLAIF-4V (~488K samples)
- **Document VQA**: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples)
- **Reasoning/Knowledge**: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples)
- **Multilingual/Cultural**: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples)
- **Specialized VQA**: IconQA, InfographicVQA, Stratos (~34K samples)
- **Counting/Math**: TallyQA, PixMo-Count (~107K samples)
- **Vision/Text**: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples)
- **Video/Text**: LLaVA-Video collections (~1.4M samples)
**Collection Types**: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages.
## Evaluation
All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
### Multiple Purpose Multimodal Benchmarks
TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks:
### Multimodal Multilingual Translation Tasks
TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities:
### Supported Languages Performance
✅ **Fully Supported**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian
📊 **Benchmark Coverage**: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks.
## Citation
If you find TowerVideo useful in your research, please consider citing the following paper:
```bibtex
@article{towervision2025,
title={Understanding and Improving Multilinguality in Vision-Language Models},
author={[Authors to be added]},
journal={[Journal to be added]},
year={2025},
note={Paper in preparation}
}
```
## Model Card Contact
For errors or additional questions about details in this model card, contact the research team.
## Acknowledgments
TowerVision builds upon the excellent work of:
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
- **[TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)** vision-language model with multilingual capabilities
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
- The broader multilingual NLP and multimodal communities