Update README.md
Browse files
README.md
CHANGED
|
@@ -3,8 +3,6 @@ library_name: transformers
|
|
| 3 |
tags:
|
| 4 |
- multimodal
|
| 5 |
- multilingual
|
| 6 |
-
- llm
|
| 7 |
-
- vision
|
| 8 |
- vlm
|
| 9 |
- translation
|
| 10 |
language:
|
|
@@ -27,14 +25,15 @@ language:
|
|
| 27 |
- nb
|
| 28 |
- nn
|
| 29 |
base_model:
|
| 30 |
-
-
|
| 31 |
-
pipeline_tag:
|
|
|
|
| 32 |
---
|
| 33 |
|
| 34 |
-
# Model Card for
|
| 35 |
|
| 36 |
-
<p align="
|
| 37 |
-
<img src="Tower.png" alt="TowerVision Logo" width="
|
| 38 |
</p>
|
| 39 |
|
| 40 |
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
|
|
@@ -55,12 +54,8 @@ This model card covers the TowerVision family, including the 2B and 9B parameter
|
|
| 55 |
|
| 56 |
| Model | Parameters | HF Link |
|
| 57 |
|-------|------------|---------|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
| TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
|
| 61 |
-
| TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)
|
| 62 |
-
| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
|
| 63 |
-
| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
|
| 64 |
|
| 65 |
## How to Use TowerVision
|
| 66 |
|
|
@@ -75,12 +70,12 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
|
|
| 75 |
|
| 76 |
# Load the model in half-precision
|
| 77 |
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
|
| 78 |
-
"utter-project/TowerVideo-
|
| 79 |
device_map="auto"
|
| 80 |
)
|
| 81 |
|
| 82 |
processor = AutoProcessor.from_pretrained(
|
| 83 |
-
"utter-project/TowerVideo-
|
| 84 |
)
|
| 85 |
|
| 86 |
# Use your local video
|
|
@@ -129,7 +124,7 @@ print(decoded)
|
|
| 129 |
|
| 130 |
**Output**: Model generates text in multiple languages.
|
| 131 |
|
| 132 |
-
**Model Architecture**:
|
| 133 |
|
| 134 |
**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
|
| 135 |
|
|
@@ -144,7 +139,7 @@ print(decoded)
|
|
| 144 |
|
| 145 |
## Training Data
|
| 146 |
|
| 147 |
-
TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
|
| 148 |
|
| 149 |
| Dataset | Samples | HF Link | |
|
| 150 |
|---------|---------|---------|-------|
|
|
@@ -211,16 +206,10 @@ If you find TowerVideo useful in your research, please consider citing the follo
|
|
| 211 |
|
| 212 |
For errors or additional questions about details in this model card, contact the research team.
|
| 213 |
|
| 214 |
-
## Terms of Use
|
| 215 |
-
|
| 216 |
-
We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of highly performant multilingual vision-language models to researchers all over the world.
|
| 217 |
-
|
| 218 |
-
This model is governed by the Apache 2.0 License.
|
| 219 |
-
|
| 220 |
## Acknowledgments
|
| 221 |
|
| 222 |
TowerVision builds upon the excellent work of:
|
| 223 |
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
|
| 224 |
-
- **[
|
| 225 |
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
|
| 226 |
- The broader multilingual NLP and multimodal communities
|
|
|
|
| 3 |
tags:
|
| 4 |
- multimodal
|
| 5 |
- multilingual
|
|
|
|
|
|
|
| 6 |
- vlm
|
| 7 |
- translation
|
| 8 |
language:
|
|
|
|
| 25 |
- nb
|
| 26 |
- nn
|
| 27 |
base_model:
|
| 28 |
+
- utter-project/TowerVideo-2B
|
| 29 |
+
pipeline_tag: video-text-to-text
|
| 30 |
+
license: cc-by-nc-sa-4.0
|
| 31 |
---
|
| 32 |
|
| 33 |
+
# Model Card for TowerVideo
|
| 34 |
|
| 35 |
+
<p align="left">
|
| 36 |
+
<img src="Tower.png" alt="TowerVision Logo" width="300">
|
| 37 |
</p>
|
| 38 |
|
| 39 |
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
|
|
|
|
| 54 |
|
| 55 |
| Model | Parameters | HF Link |
|
| 56 |
|-------|------------|---------|
|
| 57 |
+
| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVideo-2B)
|
| 58 |
+
| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVideo-9B)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
## How to Use TowerVision
|
| 61 |
|
|
|
|
| 70 |
|
| 71 |
# Load the model in half-precision
|
| 72 |
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
|
| 73 |
+
"utter-project/TowerVideo-2B",
|
| 74 |
device_map="auto"
|
| 75 |
)
|
| 76 |
|
| 77 |
processor = AutoProcessor.from_pretrained(
|
| 78 |
+
"utter-project/TowerVideo-2B"
|
| 79 |
)
|
| 80 |
|
| 81 |
# Use your local video
|
|
|
|
| 124 |
|
| 125 |
**Output**: Model generates text in multiple languages.
|
| 126 |
|
| 127 |
+
**Model Architecture**: TowerVideo uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
|
| 128 |
|
| 129 |
**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
|
| 130 |
|
|
|
|
| 139 |
|
| 140 |
## Training Data
|
| 141 |
|
| 142 |
+
TowerVision models are trained on a video/text subset of **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
|
| 143 |
|
| 144 |
| Dataset | Samples | HF Link | |
|
| 145 |
|---------|---------|---------|-------|
|
|
|
|
| 206 |
|
| 207 |
For errors or additional questions about details in this model card, contact the research team.
|
| 208 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
## Acknowledgments
|
| 210 |
|
| 211 |
TowerVision builds upon the excellent work of:
|
| 212 |
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
|
| 213 |
+
- **[TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)** vision-language model with multilingual capabilities
|
| 214 |
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
|
| 215 |
- The broader multilingual NLP and multimodal communities
|