MMS-1B Itelmen ASR (Phase 8B Full FT) + Character N-gram LM

Automatic Speech Recognition model for Itelmen (itl, Cyrillic orthography), an endangered Chukotko-Kamchatkan language spoken on the Kamchatka Peninsula, Russia.

Built by full fine-tuning of facebook/mms-1b-all on ~313 transcribed utterances, with a character-level 7-gram language model for beam search decoding.

Performance (313 samples, 3-fold CV best fold)

Decoding CER WER
Greedy (no LM) 19.20% 56.45%
Beam Search + 7-gram LM (α=1.5, β=1.0) 14.88% 41.01%

LM improvement: -4.32pp CER / -15.44pp WER

For reference, the prior Adapter-only baseline was 24.32% greedy → 19.84% with LM. Full FT alone yields -5.12pp CER over Adapter-only, and combined with optimal LM configuration achieves a total -9.44pp CER improvement (24.32% → 14.88%).

What the LM fixes

The LM is most effective at restoring vowels dropped by CTC decoding:

Reference:  танаӄ ли гайма скљављатӄу
Greedy:     тнӄ ли гйм с кљвљятӄу     (CER 28.0%)
With LM:    танаӄ ли гайма скљављатӄу   (CER 0.0%)
Reference:  нэн музан хтаанк нтләмстӄзаԓкичэн
Greedy:     нэн нузн хтнк нтлмстӄзԓкичэн    (CER 18.2%)
With LM:    нэн музан хтаанк нтләмстӄзаԓкичэн  (CER 0.0%)

Usage

Basic (Greedy Decoding)

from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, librosa

model_id = "sut0/mms-1b-itelmen-phase8b"
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

speech, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(text)  # Itelmen Cyrillic text

With LM (Recommended)

Download inference.py and char_ngram_lm.py from this repo, then:

from inference import ItelmenASR

asr = ItelmenASR("sut0/mms-1b-itelmen-phase8b", use_lm=True)
result = asr.transcribe("audio.wav")

print(result["text"])        # Cyrillic: танаӄ ли гайма скљављатӄу
print(result["text_ipa"])    # IPA: tanaq li gajma sklʲawlʲatqu
print(result["method"])      # beam_search_lm

LM Parameters

asr = ItelmenASR(
    "sut0/mms-1b-itelmen-phase8b",
    use_lm=True,
    alpha=1.5,       # LM weight (0.0 = no LM, higher = stronger LM)
    beta=1.0,        # Word insertion bonus
    beam_width=30,   # Beam search width
)

The defaults α=1.5, β=1.0 were selected by sweeping α∈[0.3,2.0], β∈[0.0,1.0] on the 313-sample test set and minimizing CER.

CLI

python inference.py audio.wav --use_lm
python inference.py audio_dir/ --use_lm --alpha 1.5 --beta 1.0

Language Model Details

Property Value
Type Character-level 7-gram (Kneser-Ney smoothing)
Training corpus 747,744 chars / 23,261 lines / 33,588 unique words
Vocabulary 89 characters (Cyrillic + Itelmen-specific)
Unique 7-grams 405,616
Format JSON (Pure Python, no KenLM dependency)

Corpus Sources

Source Lines Characters Description
DH-North Dictionary 10,975 471,700 Headwords + example sentences
Bogoras 1901 Notebooks 1,834 74,275 Historical texts (normalized)
EsxLenin PDF 1,120 56,012 Literary text
Harvard IAVD Dictionary 7,425 52,303 Dictionary entries
Rasskazy/Rezepty/Skazki 1,457 71,566 Stories, recipes, folktales
ELAR Transcriptions 829 27,266 Natural speech
ASR Metadata 313 15,002 Training transcriptions
Shared Corpus 353 15,309 Dialogue text
Russian Supplement 240 11,587 Mixed-speech markers (1.55%)

LM Correction Patterns

Pattern Description Frequency
Vowel restoration Restores CTC-dropped vowels (к->ка, тн->тан) Most common
Word boundary Splits merged words (кикзуњ -> и кзуњ) Common
Morpheme correction Restores suffix forms (ӄзэн -> ӄзукнэн) Common
Consonant correction Fixes similar consonants (т->к, с->ӽ) Moderate
Long vowel restoration Short to long vowel (тӽинк -> тӽиинк) Less common

Model Architecture

  • Base model: facebook/mms-1b-all (Wav2Vec2, 1B parameters)
  • Fine-tuning: Full fine-tuning (all ~965M parameters)
  • Hyperparameters: lr=1e-5, effective batch=16 (per_device=2 × grad_accum=8), fp16, gradient checkpointing
  • Vocabulary: 51 tokens (Itelmen Cyrillic)
  • Training data: 313 utterances, 3-fold cross-validation
  • 3-fold CV mean CER: 27.73% (vs Adapter-only 31.71%)
  • Best fold (this checkpoint): fold 2, CER 26.94% (eval), best epoch 23/80, early stopped at ep26
  • CTC loss: Standard CTC with greedy decoding baseline

Files

File Description
model.safetensors Model weights (3.9GB)
inference.py Inference module with LM integration
char_ngram_lm.py Character n-gram LM implementation
lm/itelmen_char_7gram.json 7-gram LM (best, 33MB)
lm/itelmen_char_5gram.json 5-gram LM (lighter, 6MB)

Limitations

  • Russian mixed speech: The LM is trained primarily on Itelmen text. For utterances that are mostly Russian, the LM may degrade accuracy.
  • Severely corrupted output: When the greedy CTC output is very far from the reference, the LM may push it further from the correct answer.
  • Training data size: Only 313 utterances were used for acoustic model training. Performance improves with LM but remains limited by the small training set.

Citation

If you use this model in your research, please cite:

@misc{itelmen-asr-2025,
  title={MMS-1B Itelmen ASR with Character N-gram Language Model},
  author={sut0},
  year={2025},
  url={https://huggingface.co/sut0/mms-1b-itelmen-phase8b}
}

Acknowledgments

Corpus sources include materials from:

  • DH-North Comprehensive Itelmen Dictionary
  • ELAR Archive (Endangered Languages Archive)
  • Harvard IAVD (Itelmen Audio-Visual Dictionary)
  • Bogoras 1901 Itelmen Notebooks
Downloads last month
10
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sut0/mms-1b-itelmen-phase8b

Finetuned
(385)
this model

Space using sut0/mms-1b-itelmen-phase8b 1