File size: 4,558 Bytes
d8da7f3
 
 
 
 
 
 
 
 
 
 
 
61996b9
 
 
 
 
5cf98fa
7af42d0
61996b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd2630c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
language:
- sw
metrics:
- wer
- cer
base_model:
- openai/whisper-small
pipeline_tag: automatic-speech-recognition
---


# πŸ—£οΈ SALAMA-STT β€” Swahili Whisper ASR Model

**Developer:** AI4NNOV  
**Authors:** AI4NNOV.  
**Version:** v1.0  
**License:** Apache 2.0  
**Model Type:** Automatic Speech Recognition (ASR)  
**Base Model:** `openai/whisper-small` (fine-tuned for Swahili)  

---
## 🌍 Overview

**SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** β€” a modular end-to-end **speech-to-speech AI system** built for African languages.  
This model is fine-tuned from OpenAI’s **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data.  

The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.

---

## 🧱 Model Architecture

SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks.  
The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity.

| Parameter | Value |
|------------|--------|
| Base Model | `openai/whisper-small` |
| Fine-Tuning | Full model fine-tuning (fp16 precision) |
| Optimizer | AdamW |
| Learning Rate | 1e-5 |
| Batch Size | 16 |
| Epochs | 10 |
| Frameworks | Transformers + Datasets + TorchAudio |
| Languages | Swahili (`sw`), English (`en`) |

---
## πŸ“š Dataset

| Dataset | Description | Purpose |
|----------|--------------|----------|
| `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning |
| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness |
| Common Voice validation split | 2.3 hours | Evaluation |

---
## 🧠 Model Capabilities

- Speech-to-text transcription in **Swahili**  
- Recognition of **African-accented Swahili**  
- Handles short and long-form audio  
- Supports integration with **SALAMA-LLM** for full voice assistants  
- Provides timestamped segment transcriptions  

---
## πŸ“Š Evaluation Metrics

| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement |
|---------|---------------------------|---------------------------|--------------|
| **WER (Word Error Rate)** | 1.15 | **0.43** | πŸ”» 62% |
| **CER (Character Error Rate)** | 0.39 | **0.18** | πŸ”» 54% |
| **Accuracy** | 85.2% | **95.4%** | +10.2% |

> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.

---
## βš™οΈ Usage (Python Example)

Below is a quick example for Swahili speech transcription using this model:

```python
from transformers import pipeline

# Load Swahili Whisper ASR
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="EYEDOL/salama-stt",
    chunk_length_s=30,
    device_map="auto"
)

# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"

# Transcribe audio
result = asr_pipeline(audio_path)

print("πŸ—£οΈ Transcription:")
print(result["text"])
```

**Example Output:**

> *β€œKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”*

---
## πŸ” Model Performance Summary

| Dataset | Metric | Score |
|----------|---------|-------|
| Common Voice 17.0 (test) | WER | **0.43** |
| Common Voice 17.0 (test) | CER | **0.18** |
| Local Swahili Test Set | Accuracy | **95.4%** |

---
## ⚑ Key Features

- πŸŽ™οΈ **Accurate Swahili ASR** trained on diverse voices  
- 🌍 **Adapted for African speech variations and dialects**  
- 🧩 **Lightweight and compatible with SALAMA-LLM**  
- πŸ”Š **Handles long-form recordings (β‰₯30s)**  
- πŸš€ **Fast inference optimized with FP16 precision**  

---
## 🚫 Limitations

- May misinterpret **code-mixed (Swahili-English)** speech  
- Background noise and poor microphone quality reduce accuracy  
- Domain-specific (medical/legal) terms may be transcribed inaccurately  
- Performance may decline on **non-native Swahili speakers**  

---
## πŸ”— Related Models

| Model | Description |
|--------|-------------|
| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) | Swahili instruction-tuned LLM for reasoning and dialogue |
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) | Swahili text-to-speech (VITS) model for natural speech synthesis |