Video-Text-to-Text
Transformers
Safetensors
llava_onevision
image-text-to-text
multimodal
multilingual
vlm
translation
Instructions to use utter-project/TowerVideo-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use utter-project/TowerVideo-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("utter-project/TowerVideo-2B") model = AutoModelForImageTextToText.from_pretrained("utter-project/TowerVideo-2B") - Notebooks
- Google Colab
- Kaggle
Guilherme Viveiros commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -191,12 +191,14 @@ TowerVision excels particularly in multimodal multilingual translation benchmark
|
|
| 191 |
If you find TowerVideo useful in your research, please consider citing the following paper:
|
| 192 |
|
| 193 |
```bibtex
|
| 194 |
-
@
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
| 200 |
}
|
| 201 |
```
|
| 202 |
|
|
|
|
| 191 |
If you find TowerVideo useful in your research, please consider citing the following paper:
|
| 192 |
|
| 193 |
```bibtex
|
| 194 |
+
@misc{viveiros2025towervisionunderstandingimprovingmultilinguality,
|
| 195 |
+
title={TowerVision: Understanding and Improving Multilinguality in Vision-Language Models},
|
| 196 |
+
author={André G. Viveiros and Patrick Fernandes and Saul Santos and Sonal Sannigrahi and Emmanouil Zaranis and Nuno M. Guerreiro and Amin Farajian and Pierre Colombo and Graham Neubig and André F. T. Martins},
|
| 197 |
+
year={2025},
|
| 198 |
+
eprint={2510.21849},
|
| 199 |
+
archivePrefix={arXiv},
|
| 200 |
+
primaryClass={cs.LG},
|
| 201 |
+
url={https://arxiv.org/abs/2510.21849},
|
| 202 |
}
|
| 203 |
```
|
| 204 |
|