blas-m3-lm - Irish (Gaeilge) speech-to-text

Open-source Irish ASR: a wav2vec2-CTC acoustic model with a bundled KenLM 5-gram language model (Wav2Vec2ProcessorWithLM). On BlasBench - the open Irish ASR benchmark - this is #1 on Common Voice among all measured systems, ahead of Microsoft Azure and every open and commercial model evaluated.

Try it on any device, no install: the demo Space runs this exact model - open the URL on your phone or laptop.

Results (BlasBench, irish normaliser)

Test set	WER	CER	n
Common Voice 25.0 `ga-IE`	19.95	8.42	874
FLEURS `ga-IE`	48.05	24.20	842

Numbers are read from the harness' results.json. The bundled KenLM gives -5.6 WER over the acoustic model alone. For reference, ABAIR/Fotheidil self-report 19.6 CV with their full pipeline (different normaliser) and 23.7 without LM - this model beats their no-LM number under a shared, controlled benchmark.

Usage

from transformers import AutoProcessor, Wav2Vec2ForCTC
import torch, librosa

proc  = AutoProcessor.from_pretrained("jyoutir/blas-m3-lm")   # Wav2Vec2ProcessorWithLM (KenLM bundled)
model = Wav2Vec2ForCTC.from_pretrained("jyoutir/blas-m3-lm").eval()

wav, _ = librosa.load("audio.wav", sr=16000)
logits = model(proc(wav, sampling_rate=16000, return_tensors="pt").input_values).logits
print(proc.batch_decode(logits.detach().numpy()).text[0])

The LM decoding needs pip install pyctcdecode kenlm (on Windows: conda install -c conda-forge kenlm; Python ≤ 3.12). The KenLM travels inside the repo's language_model/ folder, so AutoProcessor wires it up automatically.

Model

Acoustic: wav2vec2-XLS-R (300M) fine-tuned for Irish (~1,491h gold + broadcast silver).
LM: 5-gram KenLM built from human Irish text only (DOEGEN gold, Oireachtas ga, gold transcripts; ~20.5k lines). Common-Voice test sentences were stripped (leakage guard). pyctcdecode, beam_width=100, α=0.5, β=1.5.

Roadmap

A fully on-device / in-browser build - distilling the LM into the acoustic weights so the same quality runs offline on a phone or laptop with zero dependencies - is in progress.

License & citation

Apache-2.0. Part of the Blas Voice / BlasBench Irish ASR project.

@misc{blasbench2026, title={BlasBench: an open Irish ASR benchmark},
  author={Jyoutir Raj and John Conway}, year={2026},
  howpublished={\url{https://github.com/jyoutir/blasbench}}}

Downloads last month: 39

Safetensors

Model size

0.3B params

Tensor type

F32

jyoutir
/

blas-m3-lm