|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
language: |
|
|
- sw |
|
|
metrics: |
|
|
- wer |
|
|
- cer |
|
|
base_model: |
|
|
- openai/whisper-small |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
|
|
|
# π£οΈ SALAMA-STT β Swahili Whisper ASR Model |
|
|
|
|
|
**Developer:** AI4NNOV |
|
|
**Authors:** AI4NNOV. |
|
|
**Version:** v1.0 |
|
|
**License:** Apache 2.0 |
|
|
**Model Type:** Automatic Speech Recognition (ASR) |
|
|
**Base Model:** `openai/whisper-small` (fine-tuned for Swahili) |
|
|
|
|
|
--- |
|
|
## π Overview |
|
|
|
|
|
**SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** β a modular end-to-end **speech-to-speech AI system** built for African languages. |
|
|
This model is fine-tuned from OpenAIβs **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data. |
|
|
|
|
|
The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§± Model Architecture |
|
|
|
|
|
SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks. |
|
|
The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity. |
|
|
|
|
|
| Parameter | Value | |
|
|
|------------|--------| |
|
|
| Base Model | `openai/whisper-small` | |
|
|
| Fine-Tuning | Full model fine-tuning (fp16 precision) | |
|
|
| Optimizer | AdamW | |
|
|
| Learning Rate | 1e-5 | |
|
|
| Batch Size | 16 | |
|
|
| Epochs | 10 | |
|
|
| Frameworks | Transformers + Datasets + TorchAudio | |
|
|
| Languages | Swahili (`sw`), English (`en`) | |
|
|
|
|
|
--- |
|
|
## π Dataset |
|
|
|
|
|
| Dataset | Description | Purpose | |
|
|
|----------|--------------|----------| |
|
|
| `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning | |
|
|
| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness | |
|
|
| Common Voice validation split | 2.3 hours | Evaluation | |
|
|
|
|
|
--- |
|
|
## π§ Model Capabilities |
|
|
|
|
|
- Speech-to-text transcription in **Swahili** |
|
|
- Recognition of **African-accented Swahili** |
|
|
- Handles short and long-form audio |
|
|
- Supports integration with **SALAMA-LLM** for full voice assistants |
|
|
- Provides timestamped segment transcriptions |
|
|
|
|
|
--- |
|
|
## π Evaluation Metrics |
|
|
|
|
|
| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement | |
|
|
|---------|---------------------------|---------------------------|--------------| |
|
|
| **WER (Word Error Rate)** | 1.15 | **0.43** | π» 62% | |
|
|
| **CER (Character Error Rate)** | 0.39 | **0.18** | π» 54% | |
|
|
| **Accuracy** | 85.2% | **95.4%** | +10.2% | |
|
|
|
|
|
> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice. |
|
|
|
|
|
--- |
|
|
## βοΈ Usage (Python Example) |
|
|
|
|
|
Below is a quick example for Swahili speech transcription using this model: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load Swahili Whisper ASR |
|
|
asr_pipeline = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="EYEDOL/salama-stt", |
|
|
chunk_length_s=30, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Example audio file (replace with your file) |
|
|
audio_path = "swahili_audio_sample.wav" |
|
|
|
|
|
# Transcribe audio |
|
|
result = asr_pipeline(audio_path) |
|
|
|
|
|
print("π£οΈ Transcription:") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
**Example Output:** |
|
|
|
|
|
> *βKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.β* |
|
|
|
|
|
--- |
|
|
## π Model Performance Summary |
|
|
|
|
|
| Dataset | Metric | Score | |
|
|
|----------|---------|-------| |
|
|
| Common Voice 17.0 (test) | WER | **0.43** | |
|
|
| Common Voice 17.0 (test) | CER | **0.18** | |
|
|
| Local Swahili Test Set | Accuracy | **95.4%** | |
|
|
|
|
|
--- |
|
|
## β‘ Key Features |
|
|
|
|
|
- ποΈ **Accurate Swahili ASR** trained on diverse voices |
|
|
- π **Adapted for African speech variations and dialects** |
|
|
- π§© **Lightweight and compatible with SALAMA-LLM** |
|
|
- π **Handles long-form recordings (β₯30s)** |
|
|
- π **Fast inference optimized with FP16 precision** |
|
|
|
|
|
--- |
|
|
## π« Limitations |
|
|
|
|
|
- May misinterpret **code-mixed (Swahili-English)** speech |
|
|
- Background noise and poor microphone quality reduce accuracy |
|
|
- Domain-specific (medical/legal) terms may be transcribed inaccurately |
|
|
- Performance may decline on **non-native Swahili speakers** |
|
|
|
|
|
--- |
|
|
## π Related Models |
|
|
|
|
|
| Model | Description | |
|
|
|--------|-------------| |
|
|
| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) | Swahili instruction-tuned LLM for reasoning and dialogue | |
|
|
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) | Swahili text-to-speech (VITS) model for natural speech synthesis | |
|
|
|