Update README.md
Browse files
README.md
CHANGED
|
@@ -29,10 +29,10 @@ capabilities** through a Speech Adapter.
|
|
| 29 |
|
| 30 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
| 31 |
|
| 32 |
-
The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
|
| 33 |
with Gemma-3-4b-it-speech being a specific instance of this family.
|
| 34 |
These models maintain the original Gemma-3 capabilities while adding
|
| 35 |
-
multilingual speech recognition and translation abilities
|
| 36 |
|
| 37 |
## Evaluation
|
| 38 |
|
|
@@ -56,17 +56,18 @@ Model evaluation metrics and results.
|
|
| 56 |
|
| 57 |
## Model Details
|
| 58 |
|
| 59 |
-
|
|
|
|
| 60 |
|
| 61 |
Model type: Multimodal (Text, Vision, Speech) Language Model
|
| 62 |
|
| 63 |
Language(s): Multilingual
|
| 64 |
|
| 65 |
-
License: [Gemma]()
|
| 66 |
|
| 67 |
-
Base model: [google/gemma-3-4b-it]
|
| 68 |
|
| 69 |
-
Inspiration: [Phi-4-multimodal-instruct]
|
| 70 |
|
| 71 |
## Training Details
|
| 72 |
|
|
@@ -74,7 +75,7 @@ Inspiration: [Phi-4-multimodal-instruct]
|
|
| 74 |
|
| 75 |
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
|
| 76 |
|
| 77 |
-
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **
|
| 78 |
|
| 79 |
## Limitations
|
| 80 |
|
|
@@ -86,7 +87,7 @@ To improve the model's performance and reliability, the following areas need fur
|
|
| 86 |
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
|
| 87 |
|
| 88 |
- Due to the lack of computing resources,
|
| 89 |
-
this model **primarily recognizes audio files
|
| 90 |
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
|
| 91 |
|
| 92 |
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
|
|
@@ -146,7 +147,7 @@ print(response)
|
|
| 146 |
```
|
| 147 |
|
| 148 |
|
| 149 |
-
#### Running the model with
|
| 150 |
|
| 151 |
```python
|
| 152 |
from io import BytesIO
|
|
@@ -181,16 +182,13 @@ with torch.inference_mode():
|
|
| 181 |
print(response)
|
| 182 |
```
|
| 183 |
|
| 184 |
-
## Usage and Limitations
|
| 185 |
-
|
| 186 |
-
These models have certain limitations that users should be aware of.
|
| 187 |
|
| 188 |
### Citation
|
| 189 |
|
| 190 |
```none
|
| 191 |
@article{gemma3mm_2025,
|
| 192 |
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
|
| 193 |
-
author={
|
| 194 |
year={2025}
|
| 195 |
}
|
| 196 |
|
|
|
|
| 29 |
|
| 30 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
| 31 |
|
| 32 |
+
~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
|
| 33 |
with Gemma-3-4b-it-speech being a specific instance of this family.
|
| 34 |
These models maintain the original Gemma-3 capabilities while adding
|
| 35 |
+
multilingual speech recognition and translation abilities.~~
|
| 36 |
|
| 37 |
## Evaluation
|
| 38 |
|
|
|
|
| 56 |
|
| 57 |
## Model Details
|
| 58 |
|
| 59 |
+
[junnei]: https://huggingface.co/junnei
|
| 60 |
+
Developed by: [junnei][junnei]
|
| 61 |
|
| 62 |
Model type: Multimodal (Text, Vision, Speech) Language Model
|
| 63 |
|
| 64 |
Language(s): Multilingual
|
| 65 |
|
| 66 |
+
License: [Gemma](https://ai.google.dev/gemma/terms)
|
| 67 |
|
| 68 |
+
Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
|
| 69 |
|
| 70 |
+
Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
|
| 71 |
|
| 72 |
## Training Details
|
| 73 |
|
|
|
|
| 75 |
|
| 76 |
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
|
| 77 |
|
| 78 |
+
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
|
| 79 |
|
| 80 |
## Limitations
|
| 81 |
|
|
|
|
| 87 |
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
|
| 88 |
|
| 89 |
- Due to the lack of computing resources,
|
| 90 |
+
this model **primarily recognizes audio files less than 30 seconds** in duration.
|
| 91 |
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
|
| 92 |
|
| 93 |
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
|
|
|
|
| 147 |
```
|
| 148 |
|
| 149 |
|
| 150 |
+
#### Running the model with raw data
|
| 151 |
|
| 152 |
```python
|
| 153 |
from io import BytesIO
|
|
|
|
| 182 |
print(response)
|
| 183 |
```
|
| 184 |
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
### Citation
|
| 187 |
|
| 188 |
```none
|
| 189 |
@article{gemma3mm_2025,
|
| 190 |
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
|
| 191 |
+
author={Seongjun Jang},
|
| 192 |
year={2025}
|
| 193 |
}
|
| 194 |
|