w2v-BERT 2.0 (Galician Fine-Tuned, CTC)
This model is a fine-tuned version of facebook/w2v-bert-2.0 for automatic speech recognition (ASR) in Galician (gl), trained using a CTC objective.
The model is optimised for Galician speech and evaluated across multiple domains, including read speech, broadcast-style audio and conversational content.
Training Data
The model was trained on a combined Galician ASR dataset built from several public and curated corpora.
All audio was normalised to 16 kHz, and all transcripts were standardised to a homogeneous text format.
Datasets Included
- Common Voice v23 (Galician)
- OpenSLR Speech Translation GL-EN (Galician side)
- FLEURS GL-EN (Galician side)
- FalAI (20% of validated split)
- Transcrispeech (Galician)
- RG-Podcast (Galician)
These datasets cover clean read speech, semi-spontaneous speech, and more challenging acoustic conditions.
Dataset Preparation
- Audio resampled to 16 kHz
- Removal of empty, corrupt or invalid audio
- Minimum audio duration: 1 second
- Text normalisation:
- Lowercasing
- Unicode normalisation
- Removal of punctuation
- Removal of empty transcripts
Tokenization and Vocabulary
A character-level CTC vocabulary was constructed specifically for Galician.
- Supported characters:
abcdefghijklmnopqrstuvwxyzáéíóúñç - Word boundaries represented using the
|token - Special tokens:
[UNK][PAD]
The final vocabulary is stored in vocab.json.
Training Procedure
Fine-tuning was performed using the 🤗 Transformers Trainer with a CTC loss.
- Base model: facebook/w2v-bert-2.0
- Architecture: Wav2Vec2BertForCTC
- Adapters enabled: Yes
Training Configuration
- Effective batch size: 16
- Per-device batch size: 8
- Gradient accumulation steps: 2
- Learning rate: 5e-6
- Training epochs: 5
- Warmup ratio: 0.1
- Precision: FP16
- Gradient checkpointing: Enabled
- Max gradient norm: 1.0
- Evaluation & checkpointing: Every 2000 steps
- Checkpoint limit: 2
Audio features were extracted using SeamlessM4TFeatureExtractor, and text was tokenized with a custom Wav2Vec2CTCTokenizer.
Evaluation Results
Evaluation was performed on held-out splits for each corpus and on a combined test set.
Metrics are reported as WER (Word Error Rate) and CER (Character Error Rate).
Fine-Tuned Model Results
Per-corpus results
| Corpus | N | WER | CER |
|---|---|---|---|
| FalAI | 4776 | 0.0445 | 0.0099 |
| CommonVoice | 14563 | 0.0628 | 0.0124 |
| OpenSLR | 282 | 0.1340 | 0.0406 |
| FLEURS | 212 | 0.1330 | 0.0447 |
| Transcrispeech | 1710 | 0.1410 | 0.0481 |
| RG-Podcast | 2015 | 0.1692 | 0.0654 |
Combined test set
| Dataset | N | WER | CER |
|---|---|---|---|
| TOTAL | 23558 | 0.1163 | 0.0383 |
Comparison with Whisper
WER comparison against Whisper-based models evaluated on the same datasets:
| Corpus | w2v-BERT WER | Whisper WER |
|---|---|---|
| FalAI | 0.0445 | 0.0097 |
| CommonVoice | 0.0628 | 0.0688 |
| OpenSLR | 0.1340 | 0.0808 |
| FLEURS | 0.1330 | 0.1980 |
| Transcrispeech | 0.1410 | 0.2097 |
| RG-Podcast | 0.1692 | — |
Intended Use and Limitations
This model is intended for Galician ASR research and transcription pipelines, particularly in CTC-based or streaming-friendly setups.
Performance may degrade on highly spontaneous speech or extremely noisy audio.
The model is monolingual (Galician-only) and not intended for multilingual ASR or speech translation.
Contact information
For further information, send an email to proxecto.nos@usc.gal
Licensing information
Acknowledgements
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU).
Thanks also to Balidea for the technical development of this model.
Citation
@misc{proxectenos2026w2v-bert-2.0-gl,
author = {{Proxecto Nós}},
title = {{w2v-BERT 2.0} (Galician Fine-Tuned, CTC)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/proxectonos/w2v-bert-2.0-gl/}},
}
- Downloads last month
- -
Model tree for proxectonos/w2v-bert-2.0-gl
Base model
facebook/w2v-bert-2.0