You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Amharic Whisper (Small) – Custom Tokenizer

Whisper-SAM (Whisper Small Amharic) is a Whisper Small model fine-tuned on ~210 hours of Amharic speech, using a custom character-level tokenizer to fully utilize the 448-token context window and eliminate byte-level inefficiencies.


The Problem

Whisper’s default tokenizer is byte-level, so:

  • English: 1 letter ≈ 1 byte
  • Amharic: 1 letter ≈ 3 bytes

That means for Amharic you end up predicting multiple byte tokens per character, hitting the decoder’s limits faster, and often getting broken byte artifacts like:

(the Unicode replacement character when UTF-8 decoding fails).

Also, Whisper’s decoder has a maximum text token length of ~448 tokens, so truncation happens early if every character takes 3 tokens instead of 1.

Finally, Whisper processes a fixed 30-second audio chunk as its standard spectrogram input — about 80 mel bins over those 30 seconds.


What Changed

  • Character tokenizer: 1 Amharic character = 1 token
  • You actually hit the 448 token limit in real Amharic terms
  • No artificial byte inflation
  • No broken tokens
  • Fewer tokens to generate per utterance

This means the decoder isn’t busy spamming byte tokens and can use its full context window (448 text-token capacity) effectively.


Result

  • Faster convergence during training
  • Faster inference (fewer tokens to predict)
  • Better token efficiency
  • Better accuracy for the same small model size
  • Cleaner outputs with no incomplete UTF-8 artifacts

Note: Timestamp prediction is not supported in Whisper-SAM. The original Whisper timestamp tokens were not included in the custom tokenizer, so the model cannot generate or infer timestamp outputs.


Credit: If you use Whisper-SAM, Credit to Nekodimos would be appreciated.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nekodimos/Whisper-SAM

Finetuned
(3504)
this model