w2v-BERT 2.0 (Galician Fine-Tuned, CTC)

This model is a fine-tuned version of facebook/w2v-bert-2.0 for automatic speech recognition (ASR) in Galician (gl), trained using a CTC objective.

The model is optimised for Galician speech and evaluated across multiple domains, including read speech, broadcast-style audio and conversational content.

Training Data

The model was trained on a combined Galician ASR dataset built from several public and curated corpora.
All audio was normalised to 16 kHz, and all transcripts were standardised to a homogeneous text format.

Datasets Included

  • Common Voice v23 (Galician)
  • OpenSLR Speech Translation GL-EN (Galician side)
  • FLEURS GL-EN (Galician side)
  • FalAI (20% of validated split)
  • Transcrispeech (Galician)
  • RG-Podcast (Galician)

These datasets cover clean read speech, semi-spontaneous speech, and more challenging acoustic conditions.

Dataset Preparation

  • Audio resampled to 16 kHz
  • Removal of empty, corrupt or invalid audio
  • Minimum audio duration: 1 second
  • Text normalisation:
    • Lowercasing
    • Unicode normalisation
    • Removal of punctuation
    • Removal of empty transcripts

Tokenization and Vocabulary

A character-level CTC vocabulary was constructed specifically for Galician.

  • Supported characters:
    abcdefghijklmnopqrstuvwxyzáéíóúñç
  • Word boundaries represented using the | token
  • Special tokens:
    • [UNK]
    • [PAD]

The final vocabulary is stored in vocab.json.

Training Procedure

Fine-tuning was performed using the 🤗 Transformers Trainer with a CTC loss.

  • Base model: facebook/w2v-bert-2.0
  • Architecture: Wav2Vec2BertForCTC
  • Adapters enabled: Yes

Training Configuration

  • Effective batch size: 16
  • Per-device batch size: 8
  • Gradient accumulation steps: 2
  • Learning rate: 5e-6
  • Training epochs: 5
  • Warmup ratio: 0.1
  • Precision: FP16
  • Gradient checkpointing: Enabled
  • Max gradient norm: 1.0
  • Evaluation & checkpointing: Every 2000 steps
  • Checkpoint limit: 2

Audio features were extracted using SeamlessM4TFeatureExtractor, and text was tokenized with a custom Wav2Vec2CTCTokenizer.

Evaluation Results

Evaluation was performed on held-out splits for each corpus and on a combined test set.
Metrics are reported as WER (Word Error Rate) and CER (Character Error Rate).

Fine-Tuned Model Results

Per-corpus results

Corpus N WER CER
FalAI 4776 0.0445 0.0099
CommonVoice 14563 0.0628 0.0124
OpenSLR 282 0.1340 0.0406
FLEURS 212 0.1330 0.0447
Transcrispeech 1710 0.1410 0.0481
RG-Podcast 2015 0.1692 0.0654

Combined test set

Dataset N WER CER
TOTAL 23558 0.1163 0.0383

Comparison with Whisper

WER comparison against Whisper-based models evaluated on the same datasets:

Corpus w2v-BERT WER Whisper WER
FalAI 0.0445 0.0097
CommonVoice 0.0628 0.0688
OpenSLR 0.1340 0.0808
FLEURS 0.1330 0.1980
Transcrispeech 0.1410 0.2097
RG-Podcast 0.1692

Intended Use and Limitations

This model is intended for Galician ASR research and transcription pipelines, particularly in CTC-based or streaming-friendly setups.

Performance may degrade on highly spontaneous speech or extremely noisy audio.
The model is monolingual (Galician-only) and not intended for multilingual ASR or speech translation.

Contact information

For further information, send an email to proxecto.nos@usc.gal

Licensing information

Apache License, Version 2.0

Acknowledgements

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU).

Thanks also to Balidea for the technical development of this model.

Citation

@misc{proxectenos2026w2v-bert-2.0-gl,
  author       = {{Proxecto Nós}},
  title        = {{w2v-BERT 2.0} (Galician Fine-Tuned, CTC)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/proxectonos/w2v-bert-2.0-gl/}},
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for proxectonos/w2v-bert-2.0-gl

Finetuned
(480)
this model

Datasets used to train proxectonos/w2v-bert-2.0-gl

Collection including proxectonos/w2v-bert-2.0-gl