File size: 4,558 Bytes

---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
language:
- sw
metrics:
- wer
- cer
base_model:
- openai/whisper-small
pipeline_tag: automatic-speech-recognition
---


# 🗣️ SALAMA-STT — Swahili Whisper ASR Model

**Developer:** AI4NNOV  
**Authors:** AI4NNOV.  
**Version:** v1.0  
**License:** Apache 2.0  
**Model Type:** Automatic Speech Recognition (ASR)  
**Base Model:** `openai/whisper-small` (fine-tuned for Swahili)  

---
## 🌍 Overview

**SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** — a modular end-to-end **speech-to-speech AI system** built for African languages.  
This model is fine-tuned from OpenAI’s **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data.  

The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.

---

## 🧱 Model Architecture

SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks.  
The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity.

| Parameter | Value |
|------------|--------|
| Base Model | `openai/whisper-small` |
| Fine-Tuning | Full model fine-tuning (fp16 precision) |
| Optimizer | AdamW |
| Learning Rate | 1e-5 |
| Batch Size | 16 |
| Epochs | 10 |
| Frameworks | Transformers + Datasets + TorchAudio |
| Languages | Swahili (`sw`), English (`en`) |

---
## 📚 Dataset

| Dataset | Description | Purpose |
|----------|--------------|----------|
| `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning |
| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness |
| Common Voice validation split | 2.3 hours | Evaluation |

---
## 🧠 Model Capabilities

- Speech-to-text transcription in **Swahili**  
- Recognition of **African-accented Swahili**  
- Handles short and long-form audio  
- Supports integration with **SALAMA-LLM** for full voice assistants  
- Provides timestamped segment transcriptions  

---
## 📊 Evaluation Metrics

| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement |
|---------|---------------------------|---------------------------|--------------|
| **WER (Word Error Rate)** | 1.15 | **0.43** | 🔻 62% |
| **CER (Character Error Rate)** | 0.39 | **0.18** | 🔻 54% |
| **Accuracy** | 85.2% | **95.4%** | +10.2% |

> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.

---
## ⚙️ Usage (Python Example)

Below is a quick example for Swahili speech transcription using this model:

```python
from transformers import pipeline

# Load Swahili Whisper ASR
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="EYEDOL/salama-stt",
    chunk_length_s=30,
    device_map="auto"
)

# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"

# Transcribe audio
result = asr_pipeline(audio_path)

print("🗣️ Transcription:")
print(result["text"])
```

**Example Output:**

> *“Karibu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”*

---
## 🔍 Model Performance Summary

| Dataset | Metric | Score |
|----------|---------|-------|
| Common Voice 17.0 (test) | WER | **0.43** |
| Common Voice 17.0 (test) | CER | **0.18** |
| Local Swahili Test Set | Accuracy | **95.4%** |

---
## ⚡ Key Features

- 🎙️ **Accurate Swahili ASR** trained on diverse voices  
- 🌍 **Adapted for African speech variations and dialects**  
- 🧩 **Lightweight and compatible with SALAMA-LLM**  
- 🔊 **Handles long-form recordings (≥30s)**  
- 🚀 **Fast inference optimized with FP16 precision**  

---
## 🚫 Limitations

- May misinterpret **code-mixed (Swahili-English)** speech  
- Background noise and poor microphone quality reduce accuracy  
- Domain-specific (medical/legal) terms may be transcribed inaccurately  
- Performance may decline on **non-native Swahili speakers**  

---
## 🔗 Related Models

| Model | Description |
|--------|-------------|
| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) | Swahili instruction-tuned LLM for reasoning and dialogue |
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) | Swahili text-to-speech (VITS) model for natural speech synthesis |