msingiai/sauti-asr

This card describes the current Track A production candidate from the repo. The public-facing training-data summary omits restricted internal sources, but this does not imply a fresh clean-release retrain.

Model Summary

This model is a fine-tuned version of microsoft/paza-whisper-large-v3-turbo for Swahili automatic speech recognition. It comes from the Sauti ASR Track A pipeline in the sauti-asr repository.

Intended Release Type

Release profile: current-preview
Intended use: Public research and product evaluation

Evaluation Snapshot

The current repo Track A checkpoint was evaluated on 500 held-out Kenyan Swahili samples.

Metric	Value
WER	13.72%
CER	3.88%
Reference words	10395

Training Data

The release flow in this repository tracks the following dataset mix:

Dataset	License	Notes
`mozilla-common-voice`	Common Voice (CC0)	Used in repo Track A pipeline
`google-fleurs`	FLEURS (CC-BY-4.0)	Used in repo Track A pipeline
`alffa-swahili-news`	ALFFA / OpenSLR (MIT)	Used in repo Track A pipeline
`keystats-swahili-asr-data`	KeyStats (Apache-2.0)	Used in repo Track A pipeline

Known Limitations

Performance is weaker on code-switched Swahili/English speech.
Named entities, abbreviations, and numbers remain difficult.
Long-form transcription should use chunking instead of a single-pass decode.
The checkpoint is useful for Swahili ASR evaluation and product prototyping, but the public metadata profile is narrower than the full historical repo training mix.

Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "msingiai/sauti-asr"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    chunk_length_s=25,
)

result = pipe("audio.wav")
print(result["text"])

Source Repository

The training, evaluation, and serving code lives in:

Msingi-AI/sauti-asr

Responsible Use

This model transcribes speech. Users are responsible for obtaining rights and consent for audio they process, especially for clinical, customer-support, or other sensitive recordings.

Downloads last month: 154

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for msingiai/sauti-asr

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

microsoft/paza-whisper-large-v3-turbo

Finetuned

(1)

this model

Evaluation results

Word Error Rate on Kenyan Swahili held-out test set
self-reported

13.72%
Character Error Rate on Kenyan Swahili held-out test set
self-reported

3.88%