Amharic Whisper (Small) – Custom Tokenizer

Whisper-SAM (Whisper Small Amharic) is a Whisper Small model fine-tuned on ~210 hours of Amharic speech, using a custom character-level tokenizer to fully utilize the 448-token context window and eliminate byte-level inefficiencies.


The Problem

Whisper’s default tokenizer is byte-level, so:

  • English: 1 letter ≈ 1 byte
  • Amharic: 1 letter ≈ 3 bytes

That means for Amharic you end up predicting multiple byte tokens per character, hitting the decoder’s limits faster, and often getting broken byte artifacts like:

(the Unicode replacement character when UTF-8 decoding fails).

Also, Whisper’s decoder has a maximum text token length of ~448 tokens, so truncation happens early if every character takes 3 tokens instead of 1.

Finally, Whisper processes a fixed 30-second audio chunk as its standard spectrogram input — about 80 mel bins over those 30 seconds.


What Changed

  • Character tokenizer: 1 Amharic character = 1 token
  • You actually hit the 448 token limit in real Amharic terms
  • No artificial byte inflation
  • No broken tokens
  • Fewer tokens to generate per utterance

This means the decoder isn’t busy spamming byte tokens and can use its full context window (448 text-token capacity) effectively.


Result

  • Faster convergence during training
  • Faster inference (fewer tokens to predict)
  • Better token efficiency
  • Better accuracy for the same small model size
  • Cleaner outputs with no incomplete UTF-8 artifacts

Note: Timestamp prediction is not supported in Whisper-SAM. The original Whisper timestamp tokens were not included in the custom tokenizer, so the model cannot generate or infer timestamp outputs.


Credit: If you use Whisper-SAM, Credit to Nekodimos would be appreciated.

Downloads last month
84
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nekodimos/Whisper-SAM

Finetuned
(3339)
this model