junnei
/

gemma-3-4b-it-speech

@@ -29,10 +29,10 @@ capabilities** through a Speech Adapter.
 The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
-The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
 with Gemma-3-4b-it-speech being a specific instance of this family.
 These models maintain the original Gemma-3 capabilities while adding
-multilingual speech recognition and translation abilities.
 ## Evaluation
@@ -56,17 +56,18 @@ Model evaluation metrics and results.
 ## Model Details
-Developed by: [junnei]()
 Model type: Multimodal (Text, Vision, Speech) Language Model
 Language(s): Multilingual
-License: [Gemma]()
-Base model: [google/gemma-3-4b-it]
-Inspiration: [Phi-4-multimodal-instruct]
 ## Training Details
@@ -74,7 +75,7 @@ Inspiration: [Phi-4-multimodal-instruct]
 - Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
-- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **2-15 seconds in duration.**
 ## Limitations
@@ -86,7 +87,7 @@ To improve the model's performance and reliability, the following areas need fur
 - For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
 - Due to the lack of computing resources,
-this model **primarily recognizes audio files within 2-15 seconds** in duration.
 As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
 - If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
@@ -146,7 +147,7 @@ print(response)
 ```
-#### Running the model with local data
 ```python
 from io import BytesIO
@@ -181,16 +182,13 @@ with torch.inference_mode():
 print(response)
 ```
-## Usage and Limitations
-These models have certain limitations that users should be aware of.
 ### Citation
 ```none
 @article{gemma3mm_2025,
     title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
-    author={[junnei]},
     year={2025}
 }

 The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
+~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
 with Gemma-3-4b-it-speech being a specific instance of this family.
 These models maintain the original Gemma-3 capabilities while adding
+multilingual speech recognition and translation abilities.~~
 ## Evaluation
 ## Model Details
+[junnei]: https://huggingface.co/junnei
+Developed by: [junnei][junnei]
 Model type: Multimodal (Text, Vision, Speech) Language Model
 Language(s): Multilingual
+License: [Gemma](https://ai.google.dev/gemma/terms)
+Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
+Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
 ## Training Details
 - Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
+- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
 ## Limitations
 - For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
 - Due to the lack of computing resources,
+this model **primarily recognizes audio files less than 30 seconds** in duration.
 As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
 - If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
 ```
+#### Running the model with raw data
 ```python
 from io import BytesIO
 print(response)
 ```
 ### Citation
 ```none
 @article{gemma3mm_2025,
     title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
+    author={Seongjun Jang},
     year={2025}
 }