Update README.md
Browse files
README.md
CHANGED
|
@@ -3,8 +3,6 @@ library_name: transformers
|
|
| 3 |
tags:
|
| 4 |
- multimodal
|
| 5 |
- multilingual
|
| 6 |
-
- llm
|
| 7 |
-
- vision
|
| 8 |
- vlm
|
| 9 |
- translation
|
| 10 |
language:
|
|
@@ -27,11 +25,12 @@ language:
|
|
| 27 |
- nb
|
| 28 |
- nn
|
| 29 |
base_model:
|
| 30 |
-
-
|
| 31 |
-
pipeline_tag:
|
|
|
|
| 32 |
---
|
| 33 |
|
| 34 |
-
# Model Card for
|
| 35 |
|
| 36 |
<p align="center">
|
| 37 |
<img src="Tower.png" alt="TowerVision Logo" width="200">
|
|
@@ -55,12 +54,8 @@ This model card covers the TowerVision family, including the 2B and 9B parameter
|
|
| 55 |
|
| 56 |
| Model | Parameters | HF Link |
|
| 57 |
|-------|------------|---------|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
| TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
|
| 61 |
-
| TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)
|
| 62 |
-
| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
|
| 63 |
-
| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
|
| 64 |
|
| 65 |
## How to Use TowerVision
|
| 66 |
|
|
@@ -129,7 +124,7 @@ print(decoded)
|
|
| 129 |
|
| 130 |
**Output**: Model generates text in multiple languages.
|
| 131 |
|
| 132 |
-
**Model Architecture**:
|
| 133 |
|
| 134 |
**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
|
| 135 |
|
|
@@ -144,7 +139,7 @@ print(decoded)
|
|
| 144 |
|
| 145 |
## Training Data
|
| 146 |
|
| 147 |
-
TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
|
| 148 |
|
| 149 |
| Dataset | Samples | HF Link | |
|
| 150 |
|---------|---------|---------|-------|
|
|
@@ -211,16 +206,10 @@ If you find TowerVideo useful in your research, please consider citing the follo
|
|
| 211 |
|
| 212 |
For errors or additional questions about details in this model card, contact the research team.
|
| 213 |
|
| 214 |
-
## Terms of Use
|
| 215 |
-
|
| 216 |
-
We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of highly performant multilingual vision-language models to researchers all over the world.
|
| 217 |
-
|
| 218 |
-
This model is governed by the Apache 2.0 License.
|
| 219 |
-
|
| 220 |
## Acknowledgments
|
| 221 |
|
| 222 |
TowerVision builds upon the excellent work of:
|
| 223 |
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
|
| 224 |
-
- **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-
|
| 225 |
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
|
| 226 |
- The broader multilingual NLP and multimodal communities
|
|
|
|
| 3 |
tags:
|
| 4 |
- multimodal
|
| 5 |
- multilingual
|
|
|
|
|
|
|
| 6 |
- vlm
|
| 7 |
- translation
|
| 8 |
language:
|
|
|
|
| 25 |
- nb
|
| 26 |
- nn
|
| 27 |
base_model:
|
| 28 |
+
- utter-project/TowerVideo-2B
|
| 29 |
+
pipeline_tag: video-text-to-text
|
| 30 |
+
license: cc-by-nc-sa-4.0
|
| 31 |
---
|
| 32 |
|
| 33 |
+
# Model Card for TowerVideo
|
| 34 |
|
| 35 |
<p align="center">
|
| 36 |
<img src="Tower.png" alt="TowerVision Logo" width="200">
|
|
|
|
| 54 |
|
| 55 |
| Model | Parameters | HF Link |
|
| 56 |
|-------|------------|---------|
|
| 57 |
+
| TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVideo-2B)
|
| 58 |
+
| TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVideo-9B)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
## How to Use TowerVision
|
| 61 |
|
|
|
|
| 124 |
|
| 125 |
**Output**: Model generates text in multiple languages.
|
| 126 |
|
| 127 |
+
**Model Architecture**: TowerVideo uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
|
| 128 |
|
| 129 |
**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
|
| 130 |
|
|
|
|
| 139 |
|
| 140 |
## Training Data
|
| 141 |
|
| 142 |
+
TowerVision models are trained on a video/text subset of **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
|
| 143 |
|
| 144 |
| Dataset | Samples | HF Link | |
|
| 145 |
|---------|---------|---------|-------|
|
|
|
|
| 206 |
|
| 207 |
For errors or additional questions about details in this model card, contact the research team.
|
| 208 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
## Acknowledgments
|
| 210 |
|
| 211 |
TowerVision builds upon the excellent work of:
|
| 212 |
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
|
| 213 |
+
- **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-2B)** language models for multilingual capabilities
|
| 214 |
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
|
| 215 |
- The broader multilingual NLP and multimodal communities
|