utter-project
/

TowerVideo-9B

@@ -3,8 +3,6 @@ library_name: transformers
 tags:
 - multimodal
 - multilingual
-- llm
-- vision
 - vlm
 - translation
 language:
@@ -27,14 +25,15 @@ language:
 - nb
 - nn
 base_model:
-- Unbabel/Tower-Plus-2B
-pipeline_tag: image-text-to-text
 ---
-# Model Card for TowerVision
-<p align="center">
-<img src="Tower.png" alt="TowerVision Logo" width="200">
 </p>
 TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
@@ -55,12 +54,8 @@ This model card covers the TowerVision family, including the 2B and 9B parameter
 | Model | Parameters | HF Link |
 |-------|------------|---------|
-| TowerVision-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
-| TowerVision-2B-pt | 2B | [🤗 utter-project/TowerVision-2B-pt](https://huggingface.co/utter-project/TowerVision-2B-pt)
-| TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
-| TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)
-| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
-| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
 ## How to Use TowerVision
@@ -75,12 +70,12 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
 # Load the model in half-precision
 model = LlavaOnevisionForConditionalGeneration.from_pretrained(
-    "utter-project/TowerVideo-9B",
     device_map="auto"
 )
 processor = AutoProcessor.from_pretrained(
-    "utter-project/TowerVideo-9B"
 )
 # Use your local video
@@ -129,7 +124,7 @@ print(decoded)
 **Output**: Model generates text in multiple languages.
-**Model Architecture**: TowerVision uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
 **Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
@@ -144,7 +139,7 @@ print(decoded)
 ## Training Data
-TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
 | Dataset | Samples | HF Link |  |
 |---------|---------|---------|-------|
@@ -211,16 +206,10 @@ If you find TowerVideo useful in your research, please consider citing the follo
 For errors or additional questions about details in this model card, contact the research team.
-## Terms of Use
-We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of highly performant multilingual vision-language models to researchers all over the world.
-This model is governed by the Apache 2.0 License.
 ## Acknowledgments
 TowerVision builds upon the excellent work of:
 - **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
-- **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B)** language models for multilingual capabilities
 - **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
 - The broader multilingual NLP and multimodal communities

 tags:
 - multimodal
 - multilingual
 - vlm
 - translation
 language:
 - nb
 - nn
 base_model:
+- utter-project/TowerVideo-2B
+pipeline_tag: video-text-to-text
+license: cc-by-nc-sa-4.0
 ---
+# Model Card for TowerVideo
+<p align="left">
+<img src="Tower.png" alt="TowerVision Logo" width="300">
 </p>
 TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
 | Model | Parameters | HF Link |
 |-------|------------|---------|
+| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVideo-2B)
+| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVideo-9B)
 ## How to Use TowerVision
 # Load the model in half-precision
 model = LlavaOnevisionForConditionalGeneration.from_pretrained(
+    "utter-project/TowerVideo-2B",
     device_map="auto"
 )
 processor = AutoProcessor.from_pretrained(
+    "utter-project/TowerVideo-2B"
 )
 # Use your local video
 **Output**: Model generates text in multiple languages.
+**Model Architecture**: TowerVideo uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
 **Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
 ## Training Data
+TowerVision models are trained on a video/text subset of **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
 | Dataset | Samples | HF Link |  |
 |---------|---------|---------|-------|
 For errors or additional questions about details in this model card, contact the research team.
 ## Acknowledgments
 TowerVision builds upon the excellent work of:
 - **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
+- **[TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)** vision-language model with multilingual capabilities
 - **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
 - The broader multilingual NLP and multimodal communities