SALAMA_SM_ASR / README.md
EYEDOL's picture
Update README.md
dd2630c verified
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
language:
- sw
metrics:
- wer
- cer
base_model:
- openai/whisper-small
pipeline_tag: automatic-speech-recognition
---
# πŸ—£οΈ SALAMA-STT β€” Swahili Whisper ASR Model
**Developer:** AI4NNOV
**Authors:** AI4NNOV.
**Version:** v1.0
**License:** Apache 2.0
**Model Type:** Automatic Speech Recognition (ASR)
**Base Model:** `openai/whisper-small` (fine-tuned for Swahili)
---
## 🌍 Overview
**SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** β€” a modular end-to-end **speech-to-speech AI system** built for African languages.
This model is fine-tuned from OpenAI’s **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data.
The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.
---
## 🧱 Model Architecture
SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks.
The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity.
| Parameter | Value |
|------------|--------|
| Base Model | `openai/whisper-small` |
| Fine-Tuning | Full model fine-tuning (fp16 precision) |
| Optimizer | AdamW |
| Learning Rate | 1e-5 |
| Batch Size | 16 |
| Epochs | 10 |
| Frameworks | Transformers + Datasets + TorchAudio |
| Languages | Swahili (`sw`), English (`en`) |
---
## πŸ“š Dataset
| Dataset | Description | Purpose |
|----------|--------------|----------|
| `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning |
| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness |
| Common Voice validation split | 2.3 hours | Evaluation |
---
## 🧠 Model Capabilities
- Speech-to-text transcription in **Swahili**
- Recognition of **African-accented Swahili**
- Handles short and long-form audio
- Supports integration with **SALAMA-LLM** for full voice assistants
- Provides timestamped segment transcriptions
---
## πŸ“Š Evaluation Metrics
| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement |
|---------|---------------------------|---------------------------|--------------|
| **WER (Word Error Rate)** | 1.15 | **0.43** | πŸ”» 62% |
| **CER (Character Error Rate)** | 0.39 | **0.18** | πŸ”» 54% |
| **Accuracy** | 85.2% | **95.4%** | +10.2% |
> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.
---
## βš™οΈ Usage (Python Example)
Below is a quick example for Swahili speech transcription using this model:
```python
from transformers import pipeline
# Load Swahili Whisper ASR
asr_pipeline = pipeline(
"automatic-speech-recognition",
model="EYEDOL/salama-stt",
chunk_length_s=30,
device_map="auto"
)
# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"
# Transcribe audio
result = asr_pipeline(audio_path)
print("πŸ—£οΈ Transcription:")
print(result["text"])
```
**Example Output:**
> *β€œKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”*
---
## πŸ” Model Performance Summary
| Dataset | Metric | Score |
|----------|---------|-------|
| Common Voice 17.0 (test) | WER | **0.43** |
| Common Voice 17.0 (test) | CER | **0.18** |
| Local Swahili Test Set | Accuracy | **95.4%** |
---
## ⚑ Key Features
- πŸŽ™οΈ **Accurate Swahili ASR** trained on diverse voices
- 🌍 **Adapted for African speech variations and dialects**
- 🧩 **Lightweight and compatible with SALAMA-LLM**
- πŸ”Š **Handles long-form recordings (β‰₯30s)**
- πŸš€ **Fast inference optimized with FP16 precision**
---
## 🚫 Limitations
- May misinterpret **code-mixed (Swahili-English)** speech
- Background noise and poor microphone quality reduce accuracy
- Domain-specific (medical/legal) terms may be transcribed inaccurately
- Performance may decline on **non-native Swahili speakers**
---
## πŸ”— Related Models
| Model | Description |
|--------|-------------|
| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) | Swahili instruction-tuned LLM for reasoning and dialogue |
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) | Swahili text-to-speech (VITS) model for natural speech synthesis |