You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Amharic Whisper (Small) – Custom Tokenizer

Whisper-SAM (Whisper Small Amharic) is a Whisper Small model fine-tuned on ~210 hours of Amharic speech, using a custom character-level tokenizer to fully utilize the 448-token context window and eliminate byte-level inefficiencies.

The Problem

Whisper’s default tokenizer is byte-level, so:

English: 1 letter ≈ 1 byte
Amharic: 1 letter ≈ 3 bytes

That means for Amharic you end up predicting multiple byte tokens per character, hitting the decoder’s limits faster, and often getting broken byte artifacts like:

�

(the Unicode replacement character when UTF-8 decoding fails).

Also, Whisper’s decoder has a maximum text token length of ~448 tokens, so truncation happens early if every character takes 3 tokens instead of 1.

Finally, Whisper processes a fixed 30-second audio chunk as its standard spectrogram input — about 80 mel bins over those 30 seconds.

What Changed

Character tokenizer: 1 Amharic character = 1 token
You actually hit the 448 token limit in real Amharic terms
No artificial byte inflation
No broken � tokens
Fewer tokens to generate per utterance

This means the decoder isn’t busy spamming byte tokens and can use its full context window (448 text-token capacity) effectively.

Result

Faster convergence during training
Faster inference (fewer tokens to predict)
Better token efficiency
Better accuracy for the same small model size
Cleaner outputs with no incomplete UTF-8 artifacts

Note: Timestamp prediction is not supported in Whisper-SAM. The original Whisper timestamp tokens were not included in the custom tokenizer, so the model cannot generate or infer timestamp outputs.

Credit: If you use Whisper-SAM, Credit to Nekodimos would be appreciated.

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Nekodimos/Whisper-SAM

Base model

openai/whisper-small

Finetuned

(3504)

this model