Byrne-ASR-English
Model will be ungated for open download once I am done with the base..
A tiny (~12M parameter) character-level CTC English speech recognizer. Runs comfortably on CPU.
audio -> log-mel(80) -> 4x conv subsample -> 8 FFT blocks
(RMSNorm + RoPE/QK-Norm + Conv-SwiGLU + DERF value-gate + HRM refine) -> CTC head
Model
The shipped asr_swa.pt is the EMA (exponential-moving-average) checkpoint from a multispeaker
run on LibriTTS-R + LJSpeech, trained with EMA self-distillation + SpecAugment consistency.
The EMA weights are an online weight-average (akin to SWA), which gave the best held-out accuracy.
- Held-out (LibriTTS-R dev-clean) CER ≈ 0.085 (greedy); lower with the n-gram decoder.
- Vocab: blank + space + a–z + apostrophe (29). Lowercase, no digits/punctuation.
Usage
Self-contained — needs byrne_asr.py, asr_swa.pt, and data/:
from byrne_asr import ByrneASR
asr = ByrneASR("asr_swa.pt", device="cpu")
print(asr.transcribe("clip.wav")) # default: lexicon + bigram LM
print(asr.transcribe("clip.wav", lm="ngram")) # + pure-Python 3-gram LM (data/lm3.arpa.gz)
print(asr.transcribe("clip.wav", lm="greedy")) # raw CTC argmax
CLI: python byrne_asr.py --wav clip.wav --device cpu --lm ngram
Decoder
Lexicon-constrained CTC beam search; each completed word is scored by a language model:
lm="bigram"(default):0.4·zipf(word) + 0.3·log10(1+count(prev,word)) − 4.0(word penalty)lm="ngram": a pure-Python ARPA n-gram LM (data/lm3.arpa.gz, 3-gram with Kneser-Ney), trained on a 5M-sentence English corpus (news 2018–2020 + Wikipedia + web). No compiled deps.lm="unigram": frequency only.lm="greedy": no LM.
Lexicon: wordfreq top-120k words (data/lexicon_freq.tsv). Bigram counts: data/bigram.tsv.
The n-gram path falls back to bigram if data/lm3.arpa.gz is absent.
Honest limitations
- Domain: best on clean, wideband (≥16 kHz) English. 8 kHz narrowband / telephone audio is out of domain (the upper mel bands are empty) and transcribes poorly regardless of decoder.
- Long audio: trained on short utterances; quality degrades over very long single passes — chunk long/streaming audio into ~5–10 s windows.
- Proper names: an unseen name may map to a homophone (e.g. Byrne → "burn"). Everyday words including the classic pangram ("the quick brown fox jumps over the lazy dog") transcribe correctly.
The acoustic model is the ceiling; the LM closes the spelling/word-choice gap, not acoustic gaps.
License
MIT.