EYEDOL
/

SALAMA_SM_ASR

 base_model:
 - openai/whisper-small
 pipeline_tag: automatic-speech-recognition
+---
+# 🗣️ SALAMA-STT — Swahili Whisper ASR Model
+**Developer:** DressMatic AI Labs / EYEDOL Research
+**Authors:** Israel Adegoke et al.
+**Version:** v1.0
+**License:** Apache 2.0
+**Model Type:** Automatic Speech Recognition (ASR)
+**Base Model:** `openai/whisper-small` (fine-tuned for Swahili)
+---
+## 🌍 Overview
+**SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** — a modular end-to-end **speech-to-speech AI system** built for African languages.
+This model is fine-tuned from OpenAI’s **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data.
+The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.
+---
+## 🧱 Model Architecture
+SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks.
+The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity.
+| Parameter | Value |
+|------------|--------|
+| Base Model | `openai/whisper-small` |
+| Fine-Tuning | Full model fine-tuning (fp16 precision) |
+| Optimizer | AdamW |
+| Learning Rate | 1e-5 |
+| Batch Size | 16 |
+| Epochs | 10 |
+| Frameworks | Transformers + Datasets + TorchAudio |
+| Languages | Swahili (`sw`), English (`en`) |
+---
+## 📚 Dataset
+| Dataset | Description | Purpose |
+|----------|--------------|----------|
+| `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning |
+| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness |
+| Common Voice validation split | 2.3 hours | Evaluation |
+---
+## 🧠 Model Capabilities
+- Speech-to-text transcription in **Swahili**
+- Recognition of **African-accented Swahili**
+- Handles short and long-form audio
+- Supports integration with **SALAMA-LLM** for full voice assistants
+- Provides timestamped segment transcriptions
+---
+## 📊 Evaluation Metrics
+| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement |
+|---------|---------------------------|---------------------------|--------------|
+| **WER (Word Error Rate)** | 1.15 | **0.43** | 🔻 62% |
+| **CER (Character Error Rate)** | 0.39 | **0.18** | 🔻 54% |
+| **Accuracy** | 85.2% | **95.4%** | +10.2% |
+> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.
+---
+## ⚙️ Usage (Python Example)
+Below is a quick example for Swahili speech transcription using this model:
+```python
+from transformers import pipeline
+# Load Swahili Whisper ASR
+asr_pipeline = pipeline(
+    "automatic-speech-recognition",
+    model="EYEDOL/salama-stt",
+    chunk_length_s=30,
+    device_map="auto"
+)
+# Example audio file (replace with your file)
+audio_path = "swahili_audio_sample.wav"
+# Transcribe audio
+result = asr_pipeline(audio_path)
+print("🗣️ Transcription:")
+print(result["text"])
+```
+**Example Output:**
+> *“Karibu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”*
+---
+## 🔍 Model Performance Summary
+| Dataset | Metric | Score |
+|----------|---------|-------|
+| Common Voice 17.0 (test) | WER | **0.43** |
+| Common Voice 17.0 (test) | CER | **0.18** |
+| Local Swahili Test Set | Accuracy | **95.4%** |
+---
+## ⚡ Key Features
+- 🎙️ **Accurate Swahili ASR** trained on diverse voices
+- 🌍 **Adapted for African speech variations and dialects**
+- 🧩 **Lightweight and compatible with SALAMA-LLM**
+- 🔊 **Handles long-form recordings (≥30s)**
+- 🚀 **Fast inference optimized with FP16 precision**
+---
+## 🚫 Limitations
+- May misinterpret **code-mixed (Swahili-English)** speech
+- Background noise and poor microphone quality reduce accuracy
+- Domain-specific (medical/legal) terms may be transcribed inaccurately
+- Performance may decline on **non-native Swahili speakers**
+---
+## 🔗 Related Models
+| Model | Description |
+|--------|-------------|
+| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/salama-llm) | Swahili instruction-tuned LLM for reasoning and dialogue |
+| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/salama-tts) | Swahili text-to-speech (VITS) model for natural speech synthesis |