File size: 4,558 Bytes
d8da7f3 61996b9 5cf98fa 7af42d0 61996b9 dd2630c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
language:
- sw
metrics:
- wer
- cer
base_model:
- openai/whisper-small
pipeline_tag: automatic-speech-recognition
---
# π£οΈ SALAMA-STT β Swahili Whisper ASR Model
**Developer:** AI4NNOV
**Authors:** AI4NNOV.
**Version:** v1.0
**License:** Apache 2.0
**Model Type:** Automatic Speech Recognition (ASR)
**Base Model:** `openai/whisper-small` (fine-tuned for Swahili)
---
## π Overview
**SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** β a modular end-to-end **speech-to-speech AI system** built for African languages.
This model is fine-tuned from OpenAIβs **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data.
The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.
---
## π§± Model Architecture
SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks.
The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity.
| Parameter | Value |
|------------|--------|
| Base Model | `openai/whisper-small` |
| Fine-Tuning | Full model fine-tuning (fp16 precision) |
| Optimizer | AdamW |
| Learning Rate | 1e-5 |
| Batch Size | 16 |
| Epochs | 10 |
| Frameworks | Transformers + Datasets + TorchAudio |
| Languages | Swahili (`sw`), English (`en`) |
---
## π Dataset
| Dataset | Description | Purpose |
|----------|--------------|----------|
| `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning |
| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness |
| Common Voice validation split | 2.3 hours | Evaluation |
---
## π§ Model Capabilities
- Speech-to-text transcription in **Swahili**
- Recognition of **African-accented Swahili**
- Handles short and long-form audio
- Supports integration with **SALAMA-LLM** for full voice assistants
- Provides timestamped segment transcriptions
---
## π Evaluation Metrics
| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement |
|---------|---------------------------|---------------------------|--------------|
| **WER (Word Error Rate)** | 1.15 | **0.43** | π» 62% |
| **CER (Character Error Rate)** | 0.39 | **0.18** | π» 54% |
| **Accuracy** | 85.2% | **95.4%** | +10.2% |
> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.
---
## βοΈ Usage (Python Example)
Below is a quick example for Swahili speech transcription using this model:
```python
from transformers import pipeline
# Load Swahili Whisper ASR
asr_pipeline = pipeline(
"automatic-speech-recognition",
model="EYEDOL/salama-stt",
chunk_length_s=30,
device_map="auto"
)
# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"
# Transcribe audio
result = asr_pipeline(audio_path)
print("π£οΈ Transcription:")
print(result["text"])
```
**Example Output:**
> *βKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.β*
---
## π Model Performance Summary
| Dataset | Metric | Score |
|----------|---------|-------|
| Common Voice 17.0 (test) | WER | **0.43** |
| Common Voice 17.0 (test) | CER | **0.18** |
| Local Swahili Test Set | Accuracy | **95.4%** |
---
## β‘ Key Features
- ποΈ **Accurate Swahili ASR** trained on diverse voices
- π **Adapted for African speech variations and dialects**
- π§© **Lightweight and compatible with SALAMA-LLM**
- π **Handles long-form recordings (β₯30s)**
- π **Fast inference optimized with FP16 precision**
---
## π« Limitations
- May misinterpret **code-mixed (Swahili-English)** speech
- Background noise and poor microphone quality reduce accuracy
- Domain-specific (medical/legal) terms may be transcribed inaccurately
- Performance may decline on **non-native Swahili speakers**
---
## π Related Models
| Model | Description |
|--------|-------------|
| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) | Swahili instruction-tuned LLM for reasoning and dialogue |
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) | Swahili text-to-speech (VITS) model for natural speech synthesis |
|