asierhv commited on
Commit
f68fd95
·
verified ·
1 Parent(s): 68c2b7a

added description and "how to use" example

Browse files
Files changed (1) hide show
  1. README.md +147 -41
README.md CHANGED
@@ -28,47 +28,111 @@ model-index:
28
  value: 6.939845474613686
29
  ---
30
 
31
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
32
- should probably proofread and complete it, then remove this comment. -->
33
-
34
  # Whisper Large Galician
35
 
36
- This model is a fine-tuned version of [openai/whisper-large](https://huggingface.co/openai/whisper-large) on the mozilla-foundation/common_voice_13_0 gl dataset.
37
- It achieves the following results on the evaluation set:
38
- - Loss: 0.3605
39
- - Wer: 6.9398
 
 
 
40
 
41
  ## Model description
42
 
43
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- ## Intended uses & limitations
46
 
47
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## Training and evaluation data
50
 
51
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Training procedure
54
 
55
  ### Training hyperparameters
56
 
57
- The following hyperparameters were used during training:
58
- - learning_rate: 1e-05
59
- - train_batch_size: 32
60
- - eval_batch_size: 16
61
- - seed: 42
62
- - gradient_accumulation_steps: 2
63
- - total_train_batch_size: 64
64
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
65
- - lr_scheduler_type: linear
66
- - lr_scheduler_warmup_steps: 500
67
- - training_steps: 20000
68
-
69
- ### Training results
70
-
71
- | Training Loss | Epoch | Step | Validation Loss | Wer |
72
  |:-------------:|:-----:|:-----:|:---------------:|:------:|
73
  | 0.0126 | 4.01 | 1000 | 0.2128 | 8.3558 |
74
  | 0.0032 | 9.01 | 2000 | 0.2262 | 6.9416 |
@@ -91,27 +155,57 @@ The following hyperparameters were used during training:
91
  | 0.0 | 94.01 | 19000 | 0.3589 | 6.9467 |
92
  | 0.0 | 99.01 | 20000 | 0.3605 | 6.9398 |
93
 
 
 
 
94
 
95
- ### Framework versions
 
 
 
96
 
97
- - Transformers 4.33.0.dev0
98
- - Pytorch 2.0.1+cu117
99
- - Datasets 2.14.4
100
- - Tokenizers 0.13.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ## Citation
103
 
104
- If you use these models in your research, please cite:
105
 
106
  ```bibtex
107
  @misc{dezuazo2025whisperlmimprovingasrmodels,
108
- title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
109
- author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
110
- year={2025},
111
- eprint={2503.23542},
112
- archivePrefix={arXiv},
113
- primaryClass={cs.CL},
114
- url={https://arxiv.org/abs/2503.23542},
115
  }
116
  ```
117
 
@@ -119,9 +213,21 @@ Please, check the related paper preprint in
119
  [arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
120
  for more details.
121
 
122
- ## Licensing
 
 
123
 
124
  This model is available under the
125
  [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
126
  You are free to use, modify, and distribute this model as long as you credit
127
- the original creators.
 
 
 
 
 
 
 
 
 
 
 
28
  value: 6.939845474613686
29
  ---
30
 
 
 
 
31
  # Whisper Large Galician
32
 
33
+ ## Model summary
34
+
35
+ **Whisper Large Galician** is an automatic speech recognition (ASR) model for **Galician (gl)** speech. It is fine-tuned from [openai/whisper-large] on the **Galician portion of Mozilla Common Voice 13.0**, achieving a **Word Error Rate (WER) of 6.94%** on the Common Voice evaluation split.
36
+
37
+ This model provides high-accuracy transcription for large-scale Galician ASR applications.
38
+
39
+ ---
40
 
41
  ## Model description
42
 
43
+ * **Architecture:** Transformer-based encoder–decoder (Whisper)
44
+ * **Base model:** openai/whisper-large
45
+ * **Language:** Galician (gl)
46
+ * **Task:** Automatic Speech Recognition (ASR)
47
+ * **Output:** Text transcription in Galician
48
+ * **Decoding:** Autoregressive sequence-to-sequence decoding
49
+
50
+ The large model leverages Whisper’s multilingual pretraining and is fine-tuned on Galician speech data to deliver high-quality transcription suitable for research, media, and accessibility applications.
51
+
52
+ ---
53
+
54
+ ## Intended use
55
+
56
+ ### Primary use cases
57
+
58
+ * High-accuracy transcription of Galician audio recordings
59
+ * Offline or batch ASR pipelines
60
+ * Research and development in Galician ASR
61
+ * Media, educational, and archival transcription tasks
62
 
63
+ ### Intended users
64
 
65
+ * Researchers working on Galician or low-resource ASR
66
+ * Developers building Galician speech applications
67
+ * Academic or institutional users
68
+
69
+ ### Out-of-scope use
70
+
71
+ * Real-time or low-latency ASR without optimization
72
+ * Speech translation tasks
73
+ * Safety-critical applications without validation
74
+
75
+ ---
76
+
77
+ ## Limitations and known issues
78
+
79
+ * Performance may degrade on:
80
+ * Noisy or low-quality recordings
81
+ * Conversational or spontaneous speech
82
+ * Accents underrepresented in Common Voice
83
+ * Transcription errors may still occur under challenging acoustic conditions
84
+ * Dataset biases from Common Voice may be reflected in outputs
85
+
86
+ Users are encouraged to evaluate the model on their own data before deployment.
87
+
88
+ ---
89
 
90
  ## Training and evaluation data
91
 
92
+ ### Training data
93
+
94
+ * **Dataset:** Mozilla Common Voice 13.0 (Galician subset)
95
+ * **Data type:** Crowd-sourced, read speech
96
+ * **Preprocessing:**
97
+ * Audio resampled to 16 kHz
98
+ * Text normalized using Whisper tokenizer
99
+ * Filtering of invalid or problematic samples
100
+
101
+ ### Evaluation data
102
+
103
+ * **Dataset:** Mozilla Common Voice 13.0 (Galician evaluation split)
104
+ * **Metric:** Word Error Rate (WER)
105
+
106
+ ---
107
+
108
+ ## Evaluation results
109
+
110
+ | Metric | Value |
111
+ | ---------- | ---------- |
112
+ | WER (eval) | **6.94%** |
113
+
114
+ This reflects the expected performance of a large Whisper model fine-tuned for Galician.
115
+
116
+ ---
117
 
118
  ## Training procedure
119
 
120
  ### Training hyperparameters
121
 
122
+ * Learning rate: 1e-5
123
+ * Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-8)
124
+ * LR scheduler: Linear
125
+ * Warmup steps: 500
126
+ * Training steps: 20,000
127
+ * Train batch size: 32
128
+ * Evaluation batch size: 16
129
+ * Gradient accumulation steps: 2
130
+ * Total train batch size: 64
131
+ * Seed: 42
132
+
133
+ ### Training results (summary)
134
+
135
+ | Training Loss | Epoch | Step | Validation Loss | WER |
 
136
  |:-------------:|:-----:|:-----:|:---------------:|:------:|
137
  | 0.0126 | 4.01 | 1000 | 0.2128 | 8.3558 |
138
  | 0.0032 | 9.01 | 2000 | 0.2262 | 6.9416 |
 
155
  | 0.0 | 94.01 | 19000 | 0.3589 | 6.9467 |
156
  | 0.0 | 99.01 | 20000 | 0.3605 | 6.9398 |
157
 
158
+ ---
159
+
160
+ ## Framework versions
161
 
162
+ - Transformers 4.33.0.dev0
163
+ - PyTorch 2.0.1+cu117
164
+ - Datasets 2.14.4
165
+ - Tokenizers 0.13.3
166
 
167
+ ---
168
+
169
+ ## How to use
170
+
171
+ ```python
172
+ from transformers import pipeline
173
+
174
+ hf_model = "HiTZ/whisper-large-gl" # replace with actual repo ID
175
+ device = 0 # set to -1 for CPU
176
+
177
+ pipe = pipeline(
178
+ task="automatic-speech-recognition",
179
+ model=hf_model,
180
+ device=device
181
+ )
182
+
183
+ result = pipe("audio.wav")
184
+ print(result["text"])
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Ethical considerations and risks
190
+
191
+ * This model transcribes speech and may process personal data.
192
+ * Users should ensure compliance with applicable data protection laws (e.g., GDPR).
193
+ * The model should not be used for surveillance or non-consensual audio processing.
194
+
195
+ ---
196
 
197
  ## Citation
198
 
199
+ If you use this model in your research, please cite:
200
 
201
  ```bibtex
202
  @misc{dezuazo2025whisperlmimprovingasrmodels,
203
+ title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
204
+ author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
205
+ year={2025},
206
+ eprint={2503.23542},
207
+ archivePrefix={arXiv},
208
+ primaryClass={cs.CL}
 
209
  }
210
  ```
211
 
 
213
  [arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
214
  for more details.
215
 
216
+ ---
217
+
218
+ ## License
219
 
220
  This model is available under the
221
  [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
222
  You are free to use, modify, and distribute this model as long as you credit
223
+ the original creators.
224
+
225
+ ---
226
+
227
+ ## Contact and attribution
228
+
229
+ * Fine-tuning and evaluation: HiTZ/Aholab - Basque Center for Language Technology
230
+ * Base model: OpenAI Whisper
231
+ * Dataset: Mozilla Common Voice
232
+
233
+ For questions or issues, please open an issue in the model repository.