GuilhermeNunes commited on
Commit
6207e88
·
verified ·
1 Parent(s): 447bd8f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -24
README.md CHANGED
@@ -3,8 +3,6 @@ library_name: transformers
3
  tags:
4
  - multimodal
5
  - multilingual
6
- - llm
7
- - vision
8
  - vlm
9
  - translation
10
  language:
@@ -27,14 +25,15 @@ language:
27
  - nb
28
  - nn
29
  base_model:
30
- - Unbabel/Tower-Plus-2B
31
- pipeline_tag: image-text-to-text
 
32
  ---
33
 
34
- # Model Card for TowerVision
35
 
36
- <p align="center">
37
- <img src="Tower.png" alt="TowerVision Logo" width="200">
38
  </p>
39
 
40
  TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
@@ -55,12 +54,8 @@ This model card covers the TowerVision family, including the 2B and 9B parameter
55
 
56
  | Model | Parameters | HF Link |
57
  |-------|------------|---------|
58
- | TowerVision-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
59
- | TowerVision-2B-pt | 2B | [🤗 utter-project/TowerVision-2B-pt](https://huggingface.co/utter-project/TowerVision-2B-pt)
60
- | TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
61
- | TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)
62
- | TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
63
- | TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
64
 
65
  ## How to Use TowerVision
66
 
@@ -75,12 +70,12 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
75
 
76
  # Load the model in half-precision
77
  model = LlavaOnevisionForConditionalGeneration.from_pretrained(
78
- "utter-project/TowerVideo-9B",
79
  device_map="auto"
80
  )
81
 
82
  processor = AutoProcessor.from_pretrained(
83
- "utter-project/TowerVideo-9B"
84
  )
85
 
86
  # Use your local video
@@ -129,7 +124,7 @@ print(decoded)
129
 
130
  **Output**: Model generates text in multiple languages.
131
 
132
- **Model Architecture**: TowerVision uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
133
 
134
  **Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
135
 
@@ -144,7 +139,7 @@ print(decoded)
144
 
145
  ## Training Data
146
 
147
- TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
148
 
149
  | Dataset | Samples | HF Link | |
150
  |---------|---------|---------|-------|
@@ -211,16 +206,10 @@ If you find TowerVideo useful in your research, please consider citing the follo
211
 
212
  For errors or additional questions about details in this model card, contact the research team.
213
 
214
- ## Terms of Use
215
-
216
- We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of highly performant multilingual vision-language models to researchers all over the world.
217
-
218
- This model is governed by the Apache 2.0 License.
219
-
220
  ## Acknowledgments
221
 
222
  TowerVision builds upon the excellent work of:
223
  - **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
224
- - **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B)** language models for multilingual capabilities
225
  - **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
226
  - The broader multilingual NLP and multimodal communities
 
3
  tags:
4
  - multimodal
5
  - multilingual
 
 
6
  - vlm
7
  - translation
8
  language:
 
25
  - nb
26
  - nn
27
  base_model:
28
+ - utter-project/TowerVideo-2B
29
+ pipeline_tag: video-text-to-text
30
+ license: cc-by-nc-sa-4.0
31
  ---
32
 
33
+ # Model Card for TowerVideo
34
 
35
+ <p align="left">
36
+ <img src="Tower.png" alt="TowerVision Logo" width="300">
37
  </p>
38
 
39
  TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
 
54
 
55
  | Model | Parameters | HF Link |
56
  |-------|------------|---------|
57
+ | TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVideo-2B)
58
+ | TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVideo-9B)
 
 
 
 
59
 
60
  ## How to Use TowerVision
61
 
 
70
 
71
  # Load the model in half-precision
72
  model = LlavaOnevisionForConditionalGeneration.from_pretrained(
73
+ "utter-project/TowerVideo-2B",
74
  device_map="auto"
75
  )
76
 
77
  processor = AutoProcessor.from_pretrained(
78
+ "utter-project/TowerVideo-2B"
79
  )
80
 
81
  # Use your local video
 
124
 
125
  **Output**: Model generates text in multiple languages.
126
 
127
+ **Model Architecture**: TowerVideo uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
128
 
129
  **Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
130
 
 
139
 
140
  ## Training Data
141
 
142
+ TowerVision models are trained on a video/text subset of **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
143
 
144
  | Dataset | Samples | HF Link | |
145
  |---------|---------|---------|-------|
 
206
 
207
  For errors or additional questions about details in this model card, contact the research team.
208
 
 
 
 
 
 
 
209
  ## Acknowledgments
210
 
211
  TowerVision builds upon the excellent work of:
212
  - **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
213
+ - **[TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)** vision-language model with multilingual capabilities
214
  - **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
215
  - The broader multilingual NLP and multimodal communities