blas-m3-lm - Irish (Gaeilge) speech-to-text

Open-source Irish ASR: a wav2vec2-CTC acoustic model with a bundled KenLM 5-gram language model (Wav2Vec2ProcessorWithLM). On BlasBench - the open Irish ASR benchmark - this is #1 on Common Voice among all measured systems, ahead of Microsoft Azure and every open and commercial model evaluated.

Try it on any device, no install: the demo Space runs this exact model - open the URL on your phone or laptop.

Results (BlasBench, irish normaliser)

Test set WER CER n
Common Voice 25.0 ga-IE 19.95 8.42 874
FLEURS ga-IE 48.05 24.20 842

Numbers are read from the harness' results.json. The bundled KenLM gives -5.6 WER over the acoustic model alone. For reference, ABAIR/Fotheidil self-report 19.6 CV with their full pipeline (different normaliser) and 23.7 without LM - this model beats their no-LM number under a shared, controlled benchmark.

Usage

from transformers import AutoProcessor, Wav2Vec2ForCTC
import torch, librosa

proc  = AutoProcessor.from_pretrained("jyoutir/blas-m3-lm")   # Wav2Vec2ProcessorWithLM (KenLM bundled)
model = Wav2Vec2ForCTC.from_pretrained("jyoutir/blas-m3-lm").eval()

wav, _ = librosa.load("audio.wav", sr=16000)
logits = model(proc(wav, sampling_rate=16000, return_tensors="pt").input_values).logits
print(proc.batch_decode(logits.detach().numpy()).text[0])

The LM decoding needs pip install pyctcdecode kenlm (on Windows: conda install -c conda-forge kenlm; Python ≤ 3.12). The KenLM travels inside the repo's language_model/ folder, so AutoProcessor wires it up automatically.

Model

  • Acoustic: wav2vec2-XLS-R (300M) fine-tuned for Irish (~1,491h gold + broadcast silver).
  • LM: 5-gram KenLM built from human Irish text only (DOEGEN gold, Oireachtas ga, gold transcripts; ~20.5k lines). Common-Voice test sentences were stripped (leakage guard). pyctcdecode, beam_width=100, α=0.5, β=1.5.

Roadmap

A fully on-device / in-browser build - distilling the LM into the acoustic weights so the same quality runs offline on a phone or laptop with zero dependencies - is in progress.

License & citation

Apache-2.0. Part of the Blas Voice / BlasBench Irish ASR project.

@misc{blasbench2026, title={BlasBench: an open Irish ASR benchmark},
  author={Jyoutir Raj and John Conway}, year={2026},
  howpublished={\url{https://github.com/jyoutir/blasbench}}}
Downloads last month
39
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train jyoutir/blas-m3-lm

Space using jyoutir/blas-m3-lm 1