MMS-1B Itelmen ASR (Phase 8B Full FT) + Character N-gram LM

Automatic Speech Recognition model for Itelmen (itl, Cyrillic orthography), an endangered Chukotko-Kamchatkan language spoken on the Kamchatka Peninsula, Russia.

Built by full fine-tuning of facebook/mms-1b-all on ~313 transcribed utterances, with a character-level 7-gram language model for beam search decoding.

Performance (313 samples, 3-fold CV best fold)

Decoding	CER	WER
Greedy (no LM)	19.20%	56.45%
Beam Search + 7-gram LM (α=1.5, β=1.0)	14.88%	41.01%

LM improvement: -4.32pp CER / -15.44pp WER

For reference, the prior Adapter-only baseline was 24.32% greedy → 19.84% with LM. Full FT alone yields -5.12pp CER over Adapter-only, and combined with optimal LM configuration achieves a total -9.44pp CER improvement (24.32% → 14.88%).

What the LM fixes

The LM is most effective at restoring vowels dropped by CTC decoding:

Reference:  танаӄ ли гайма скљављатӄу
Greedy:     тнӄ ли гйм с кљвљятӄу     (CER 28.0%)
With LM:    танаӄ ли гайма скљављатӄу   (CER 0.0%)

Reference:  нэн музан хтаанк нтләмстӄзаԓкичэн
Greedy:     нэн нузн хтнк нтлмстӄзԓкичэн    (CER 18.2%)
With LM:    нэн музан хтаанк нтләмстӄзаԓкичэн  (CER 0.0%)

Usage

Basic (Greedy Decoding)

from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, librosa

model_id = "sut0/mms-1b-itelmen-phase8b"
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

speech, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(text)  # Itelmen Cyrillic text

With LM (Recommended)

Download inference.py and char_ngram_lm.py from this repo, then:

from inference import ItelmenASR

asr = ItelmenASR("sut0/mms-1b-itelmen-phase8b", use_lm=True)
result = asr.transcribe("audio.wav")

print(result["text"])        # Cyrillic: танаӄ ли гайма скљављатӄу
print(result["text_ipa"])    # IPA: tanaq li gajma sklʲawlʲatqu
print(result["method"])      # beam_search_lm

LM Parameters

asr = ItelmenASR(
    "sut0/mms-1b-itelmen-phase8b",
    use_lm=True,
    alpha=1.5,       # LM weight (0.0 = no LM, higher = stronger LM)
    beta=1.0,        # Word insertion bonus
    beam_width=30,   # Beam search width
)

The defaults α=1.5, β=1.0 were selected by sweeping α∈[0.3,2.0], β∈[0.0,1.0] on the 313-sample test set and minimizing CER.

CLI

python inference.py audio.wav --use_lm
python inference.py audio_dir/ --use_lm --alpha 1.5 --beta 1.0

Language Model Details

Property	Value
Type	Character-level 7-gram (Kneser-Ney smoothing)
Training corpus	747,744 chars / 23,261 lines / 33,588 unique words
Vocabulary	89 characters (Cyrillic + Itelmen-specific)
Unique 7-grams	405,616
Format	JSON (Pure Python, no KenLM dependency)

Corpus Sources

Source	Lines	Characters	Description
DH-North Dictionary	10,975	471,700	Headwords + example sentences
Bogoras 1901 Notebooks	1,834	74,275	Historical texts (normalized)
EsxLenin PDF	1,120	56,012	Literary text
Harvard IAVD Dictionary	7,425	52,303	Dictionary entries
Rasskazy/Rezepty/Skazki	1,457	71,566	Stories, recipes, folktales
ELAR Transcriptions	829	27,266	Natural speech
ASR Metadata	313	15,002	Training transcriptions
Shared Corpus	353	15,309	Dialogue text
Russian Supplement	240	11,587	Mixed-speech markers (1.55%)

LM Correction Patterns

Pattern	Description	Frequency
Vowel restoration	Restores CTC-dropped vowels (к->ка, тн->тан)	Most common
Word boundary	Splits merged words (кикзуњ -> и кзуњ)	Common
Morpheme correction	Restores suffix forms (ӄзэн -> ӄзукнэн)	Common
Consonant correction	Fixes similar consonants (т->к, с->ӽ)	Moderate
Long vowel restoration	Short to long vowel (тӽинк -> тӽиинк)	Less common

Model Architecture

Base model: facebook/mms-1b-all (Wav2Vec2, 1B parameters)
Fine-tuning: Full fine-tuning (all ~965M parameters)
Hyperparameters: lr=1e-5, effective batch=16 (per_device=2 × grad_accum=8), fp16, gradient checkpointing
Vocabulary: 51 tokens (Itelmen Cyrillic)
Training data: 313 utterances, 3-fold cross-validation
3-fold CV mean CER: 27.73% (vs Adapter-only 31.71%)
Best fold (this checkpoint): fold 2, CER 26.94% (eval), best epoch 23/80, early stopped at ep26
CTC loss: Standard CTC with greedy decoding baseline

Files

File	Description
`model.safetensors`	Model weights (3.9GB)
`inference.py`	Inference module with LM integration
`char_ngram_lm.py`	Character n-gram LM implementation
`lm/itelmen_char_7gram.json`	7-gram LM (best, 33MB)
`lm/itelmen_char_5gram.json`	5-gram LM (lighter, 6MB)

Limitations

Russian mixed speech: The LM is trained primarily on Itelmen text. For utterances that are mostly Russian, the LM may degrade accuracy.
Severely corrupted output: When the greedy CTC output is very far from the reference, the LM may push it further from the correct answer.
Training data size: Only 313 utterances were used for acoustic model training. Performance improves with LM but remains limited by the small training set.

Citation

If you use this model in your research, please cite:

@misc{itelmen-asr-2025,
  title={MMS-1B Itelmen ASR with Character N-gram Language Model},
  author={sut0},
  year={2025},
  url={https://huggingface.co/sut0/mms-1b-itelmen-phase8b}
}

Acknowledgments

Corpus sources include materials from:

DH-North Comprehensive Itelmen Dictionary
ELAR Archive (Endangered Languages Archive)
Harvard IAVD (Itelmen Audio-Visual Dictionary)
Bogoras 1901 Itelmen Notebooks

Downloads last month: 3

Safetensors

Model size

1.0B params

Tensor type

F32

Model tree for sut0/mms-1b-itelmen-phase8b

Base model

facebook/mms-1b-all

Finetuned

(416)

this model

sut0
/

mms-1b-itelmen-phase8b