msingiai/sauti-asr

This card describes the current Track A production candidate from the repo. The public-facing training-data summary omits restricted internal sources, but this does not imply a fresh clean-release retrain.

Model Summary

This model is a fine-tuned version of microsoft/paza-whisper-large-v3-turbo for Swahili automatic speech recognition. It comes from the Sauti ASR Track A pipeline in the sauti-asr repository.

Intended Release Type

  • Release profile: current-preview
  • Intended use: Public research and product evaluation

Evaluation Snapshot

The current repo Track A checkpoint was evaluated on 500 held-out Kenyan Swahili samples.

Metric Value
WER 13.72%
CER 3.88%
Reference words 10395

Training Data

The release flow in this repository tracks the following dataset mix:

Dataset License Notes
mozilla-common-voice Common Voice (CC0) Used in repo Track A pipeline
google-fleurs FLEURS (CC-BY-4.0) Used in repo Track A pipeline
alffa-swahili-news ALFFA / OpenSLR (MIT) Used in repo Track A pipeline
keystats-swahili-asr-data KeyStats (Apache-2.0) Used in repo Track A pipeline

Known Limitations

  • Performance is weaker on code-switched Swahili/English speech.
  • Named entities, abbreviations, and numbers remain difficult.
  • Long-form transcription should use chunking instead of a single-pass decode.
  • The checkpoint is useful for Swahili ASR evaluation and product prototyping, but the public metadata profile is narrower than the full historical repo training mix.

Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "msingiai/sauti-asr"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    chunk_length_s=25,
)

result = pipe("audio.wav")
print(result["text"])

Source Repository

The training, evaluation, and serving code lives in:

  • Msingi-AI/sauti-asr

Responsible Use

This model transcribes speech. Users are responsible for obtaining rights and consent for audio they process, especially for clinical, customer-support, or other sensitive recordings.

Downloads last month
154
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for msingiai/sauti-asr

Evaluation results

  • Word Error Rate on Kenyan Swahili held-out test set
    self-reported
    13.72%
  • Character Error Rate on Kenyan Swahili held-out test set
    self-reported
    3.88%