You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Byrne-ASR-English

Model will be ungated for open download once I am done with the base..

A tiny (~12M parameter) character-level CTC English speech recognizer. Runs comfortably on CPU.

audio -> log-mel(80) -> 4x conv subsample -> 8 FFT blocks
         (RMSNorm + RoPE/QK-Norm + Conv-SwiGLU + DERF value-gate + HRM refine) -> CTC head

Model

The shipped asr_swa.pt is the EMA (exponential-moving-average) checkpoint from a multispeaker run on LibriTTS-R + LJSpeech, trained with EMA self-distillation + SpecAugment consistency. The EMA weights are an online weight-average (akin to SWA), which gave the best held-out accuracy.

  • Held-out (LibriTTS-R dev-clean) CER ≈ 0.085 (greedy); lower with the n-gram decoder.
  • Vocab: blank + space + a–z + apostrophe (29). Lowercase, no digits/punctuation.

Usage

Self-contained — needs byrne_asr.py, asr_swa.pt, and data/:

from byrne_asr import ByrneASR

asr = ByrneASR("asr_swa.pt", device="cpu")
print(asr.transcribe("clip.wav"))                 # default: lexicon + bigram LM
print(asr.transcribe("clip.wav", lm="ngram"))     # + pure-Python 3-gram LM (data/lm3.arpa.gz)
print(asr.transcribe("clip.wav", lm="greedy"))    # raw CTC argmax

CLI: python byrne_asr.py --wav clip.wav --device cpu --lm ngram

Decoder

Lexicon-constrained CTC beam search; each completed word is scored by a language model:

  • lm="bigram" (default): 0.4·zipf(word) + 0.3·log10(1+count(prev,word)) − 4.0 (word penalty)
  • lm="ngram": a pure-Python ARPA n-gram LM (data/lm3.arpa.gz, 3-gram with Kneser-Ney), trained on a 5M-sentence English corpus (news 2018–2020 + Wikipedia + web). No compiled deps.
  • lm="unigram": frequency only. lm="greedy": no LM.

Lexicon: wordfreq top-120k words (data/lexicon_freq.tsv). Bigram counts: data/bigram.tsv. The n-gram path falls back to bigram if data/lm3.arpa.gz is absent.

Honest limitations

  • Domain: best on clean, wideband (≥16 kHz) English. 8 kHz narrowband / telephone audio is out of domain (the upper mel bands are empty) and transcribes poorly regardless of decoder.
  • Long audio: trained on short utterances; quality degrades over very long single passes — chunk long/streaming audio into ~5–10 s windows.
  • Proper names: an unseen name may map to a homophone (e.g. Byrne → "burn"). Everyday words including the classic pangram ("the quick brown fox jumps over the lazy dog") transcribe correctly.

The acoustic model is the ceiling; the LM closes the spelling/word-choice gap, not acoustic gaps.

License

MIT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Quazim0t0/Byrne-ASR-English 1

Collection including Quazim0t0/Byrne-ASR-English