Amharic Whisper (Small) – Custom Tokenizer
Whisper-SAM (Whisper Small Amharic) is a Whisper Small model fine-tuned on ~210 hours of Amharic speech, using a custom character-level tokenizer to fully utilize the 448-token context window and eliminate byte-level inefficiencies.
The Problem
Whisper’s default tokenizer is byte-level, so:
- English: 1 letter ≈ 1 byte
- Amharic: 1 letter ≈ 3 bytes
That means for Amharic you end up predicting multiple byte tokens per character, hitting the decoder’s limits faster, and often getting broken byte artifacts like:
�
(the Unicode replacement character when UTF-8 decoding fails).
Also, Whisper’s decoder has a maximum text token length of ~448 tokens, so truncation happens early if every character takes 3 tokens instead of 1.
Finally, Whisper processes a fixed 30-second audio chunk as its standard spectrogram input — about 80 mel bins over those 30 seconds.
What Changed
- Character tokenizer: 1 Amharic character = 1 token
- You actually hit the 448 token limit in real Amharic terms
- No artificial byte inflation
- No broken
�tokens - Fewer tokens to generate per utterance
This means the decoder isn’t busy spamming byte tokens and can use its full context window (448 text-token capacity) effectively.
Result
- Faster convergence during training
- Faster inference (fewer tokens to predict)
- Better token efficiency
- Better accuracy for the same small model size
- Cleaner outputs with no incomplete UTF-8 artifacts
Note: Timestamp prediction is not supported in Whisper-SAM. The original Whisper timestamp tokens were not included in the custom tokenizer, so the model cannot generate or infer timestamp outputs.
Credit: If you use Whisper-SAM, Credit to Nekodimos would be appreciated.
- Downloads last month
- 84
Model tree for Nekodimos/Whisper-SAM
Base model
openai/whisper-small