MMS-1B Itelmen ASR (Phase 8B Full FT) + Character N-gram LM
Automatic Speech Recognition model for Itelmen (itl, Cyrillic orthography), an endangered Chukotko-Kamchatkan language spoken on the Kamchatka Peninsula, Russia.
Built by full fine-tuning of facebook/mms-1b-all on ~313 transcribed utterances, with a character-level 7-gram language model for beam search decoding.
Performance (313 samples, 3-fold CV best fold)
| Decoding | CER | WER |
|---|---|---|
| Greedy (no LM) | 19.20% | 56.45% |
| Beam Search + 7-gram LM (α=1.5, β=1.0) | 14.88% | 41.01% |
LM improvement: -4.32pp CER / -15.44pp WER
For reference, the prior Adapter-only baseline was 24.32% greedy → 19.84% with LM. Full FT alone yields -5.12pp CER over Adapter-only, and combined with optimal LM configuration achieves a total -9.44pp CER improvement (24.32% → 14.88%).
What the LM fixes
The LM is most effective at restoring vowels dropped by CTC decoding:
Reference: танаӄ ли гайма скљављатӄу
Greedy: тнӄ ли гйм с кљвљятӄу (CER 28.0%)
With LM: танаӄ ли гайма скљављатӄу (CER 0.0%)
Reference: нэн музан хтаанк нтләмстӄзаԓкичэн
Greedy: нэн нузн хтнк нтлмстӄзԓкичэн (CER 18.2%)
With LM: нэн музан хтаанк нтләмстӄзаԓкичэн (CER 0.0%)
Usage
Basic (Greedy Decoding)
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, librosa
model_id = "sut0/mms-1b-itelmen-phase8b"
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
speech, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(text) # Itelmen Cyrillic text
With LM (Recommended)
Download inference.py and char_ngram_lm.py from this repo, then:
from inference import ItelmenASR
asr = ItelmenASR("sut0/mms-1b-itelmen-phase8b", use_lm=True)
result = asr.transcribe("audio.wav")
print(result["text"]) # Cyrillic: танаӄ ли гайма скљављатӄу
print(result["text_ipa"]) # IPA: tanaq li gajma sklʲawlʲatqu
print(result["method"]) # beam_search_lm
LM Parameters
asr = ItelmenASR(
"sut0/mms-1b-itelmen-phase8b",
use_lm=True,
alpha=1.5, # LM weight (0.0 = no LM, higher = stronger LM)
beta=1.0, # Word insertion bonus
beam_width=30, # Beam search width
)
The defaults α=1.5, β=1.0 were selected by sweeping α∈[0.3,2.0], β∈[0.0,1.0] on the 313-sample test set and minimizing CER.
CLI
python inference.py audio.wav --use_lm
python inference.py audio_dir/ --use_lm --alpha 1.5 --beta 1.0
Language Model Details
| Property | Value |
|---|---|
| Type | Character-level 7-gram (Kneser-Ney smoothing) |
| Training corpus | 747,744 chars / 23,261 lines / 33,588 unique words |
| Vocabulary | 89 characters (Cyrillic + Itelmen-specific) |
| Unique 7-grams | 405,616 |
| Format | JSON (Pure Python, no KenLM dependency) |
Corpus Sources
| Source | Lines | Characters | Description |
|---|---|---|---|
| DH-North Dictionary | 10,975 | 471,700 | Headwords + example sentences |
| Bogoras 1901 Notebooks | 1,834 | 74,275 | Historical texts (normalized) |
| EsxLenin PDF | 1,120 | 56,012 | Literary text |
| Harvard IAVD Dictionary | 7,425 | 52,303 | Dictionary entries |
| Rasskazy/Rezepty/Skazki | 1,457 | 71,566 | Stories, recipes, folktales |
| ELAR Transcriptions | 829 | 27,266 | Natural speech |
| ASR Metadata | 313 | 15,002 | Training transcriptions |
| Shared Corpus | 353 | 15,309 | Dialogue text |
| Russian Supplement | 240 | 11,587 | Mixed-speech markers (1.55%) |
LM Correction Patterns
| Pattern | Description | Frequency |
|---|---|---|
| Vowel restoration | Restores CTC-dropped vowels (к->ка, тн->тан) | Most common |
| Word boundary | Splits merged words (кикзуњ -> и кзуњ) | Common |
| Morpheme correction | Restores suffix forms (ӄзэн -> ӄзукнэн) | Common |
| Consonant correction | Fixes similar consonants (т->к, с->ӽ) | Moderate |
| Long vowel restoration | Short to long vowel (тӽинк -> тӽиинк) | Less common |
Model Architecture
- Base model: facebook/mms-1b-all (Wav2Vec2, 1B parameters)
- Fine-tuning: Full fine-tuning (all ~965M parameters)
- Hyperparameters: lr=1e-5, effective batch=16 (per_device=2 × grad_accum=8), fp16, gradient checkpointing
- Vocabulary: 51 tokens (Itelmen Cyrillic)
- Training data: 313 utterances, 3-fold cross-validation
- 3-fold CV mean CER: 27.73% (vs Adapter-only 31.71%)
- Best fold (this checkpoint): fold 2, CER 26.94% (eval), best epoch 23/80, early stopped at ep26
- CTC loss: Standard CTC with greedy decoding baseline
Files
| File | Description |
|---|---|
model.safetensors |
Model weights (3.9GB) |
inference.py |
Inference module with LM integration |
char_ngram_lm.py |
Character n-gram LM implementation |
lm/itelmen_char_7gram.json |
7-gram LM (best, 33MB) |
lm/itelmen_char_5gram.json |
5-gram LM (lighter, 6MB) |
Limitations
- Russian mixed speech: The LM is trained primarily on Itelmen text. For utterances that are mostly Russian, the LM may degrade accuracy.
- Severely corrupted output: When the greedy CTC output is very far from the reference, the LM may push it further from the correct answer.
- Training data size: Only 313 utterances were used for acoustic model training. Performance improves with LM but remains limited by the small training set.
Citation
If you use this model in your research, please cite:
@misc{itelmen-asr-2025,
title={MMS-1B Itelmen ASR with Character N-gram Language Model},
author={sut0},
year={2025},
url={https://huggingface.co/sut0/mms-1b-itelmen-phase8b}
}
Acknowledgments
Corpus sources include materials from:
- DH-North Comprehensive Itelmen Dictionary
- ELAR Archive (Endangered Languages Archive)
- Harvard IAVD (Itelmen Audio-Visual Dictionary)
- Bogoras 1901 Itelmen Notebooks
- Downloads last month
- 10
Model tree for sut0/mms-1b-itelmen-phase8b
Base model
facebook/mms-1b-all