Video-Text-to-Text
Transformers
Safetensors
llava_onevision
image-text-to-text
multimodal
multilingual
vlm
translation
Instructions to use utter-project/TowerVideo-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use utter-project/TowerVideo-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("utter-project/TowerVideo-2B") model = AutoModelForImageTextToText.from_pretrained("utter-project/TowerVideo-2B") - Notebooks
- Google Colab
- Kaggle
Guilherme Viveiros commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -33,7 +33,7 @@ license: cc-by-nc-sa-4.0
|
|
| 33 |
# Model Card for TowerVideo
|
| 34 |
|
| 35 |
<p align="left">
|
| 36 |
-
<img src="Tower.png" alt="TowerVision Logo" width="
|
| 37 |
</p>
|
| 38 |
|
| 39 |
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
|
|
|
|
| 33 |
# Model Card for TowerVideo
|
| 34 |
|
| 35 |
<p align="left">
|
| 36 |
+
<img src="Tower.png" alt="TowerVision Logo" width="200">
|
| 37 |
</p>
|
| 38 |
|
| 39 |
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
|