--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_17_0 language: - sw metrics: - wer - cer base_model: - openai/whisper-small pipeline_tag: automatic-speech-recognition --- # πŸ—£οΈ SALAMA-STT β€” Swahili Whisper ASR Model **Developer:** AI4NNOV **Authors:** AI4NNOV. **Version:** v1.0 **License:** Apache 2.0 **Model Type:** Automatic Speech Recognition (ASR) **Base Model:** `openai/whisper-small` (fine-tuned for Swahili) --- ## 🌍 Overview **SALAMA-STT** (Speech-to-Text) is the **first module** of the **SALAMA Framework** β€” a modular end-to-end **speech-to-speech AI system** built for African languages. This model is fine-tuned from OpenAI’s **Whisper-small** architecture for **Swahili speech recognition**, enhancing performance on African accents and conversational data. The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules. --- ## 🧱 Model Architecture SALAMA-STT leverages the **Whisper-small** architecture with a **transformer encoder-decoder** optimized for low-resource Swahili audio transcription tasks. The model was fine-tuned on the **Mozilla Common Voice 17.0 Swahili** dataset, ensuring robustness to diverse accents and speech clarity. | Parameter | Value | |------------|--------| | Base Model | `openai/whisper-small` | | Fine-Tuning | Full model fine-tuning (fp16 precision) | | Optimizer | AdamW | | Learning Rate | 1e-5 | | Batch Size | 16 | | Epochs | 10 | | Frameworks | Transformers + Datasets + TorchAudio | | Languages | Swahili (`sw`), English (`en`) | --- ## πŸ“š Dataset | Dataset | Description | Purpose | |----------|--------------|----------| | `mozilla-foundation/common_voice_17_0` | 20 hours of Swahili speech data | Supervised fine-tuning | | Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness | | Common Voice validation split | 2.3 hours | Evaluation | --- ## 🧠 Model Capabilities - Speech-to-text transcription in **Swahili** - Recognition of **African-accented Swahili** - Handles short and long-form audio - Supports integration with **SALAMA-LLM** for full voice assistants - Provides timestamped segment transcriptions --- ## πŸ“Š Evaluation Metrics | Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement | |---------|---------------------------|---------------------------|--------------| | **WER (Word Error Rate)** | 1.15 | **0.43** | πŸ”» 62% | | **CER (Character Error Rate)** | 0.39 | **0.18** | πŸ”» 54% | | **Accuracy** | 85.2% | **95.4%** | +10.2% | > Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice. --- ## βš™οΈ Usage (Python Example) Below is a quick example for Swahili speech transcription using this model: ```python from transformers import pipeline # Load Swahili Whisper ASR asr_pipeline = pipeline( "automatic-speech-recognition", model="EYEDOL/salama-stt", chunk_length_s=30, device_map="auto" ) # Example audio file (replace with your file) audio_path = "swahili_audio_sample.wav" # Transcribe audio result = asr_pipeline(audio_path) print("πŸ—£οΈ Transcription:") print(result["text"]) ``` **Example Output:** > *β€œKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”* --- ## πŸ” Model Performance Summary | Dataset | Metric | Score | |----------|---------|-------| | Common Voice 17.0 (test) | WER | **0.43** | | Common Voice 17.0 (test) | CER | **0.18** | | Local Swahili Test Set | Accuracy | **95.4%** | --- ## ⚑ Key Features - πŸŽ™οΈ **Accurate Swahili ASR** trained on diverse voices - 🌍 **Adapted for African speech variations and dialects** - 🧩 **Lightweight and compatible with SALAMA-LLM** - πŸ”Š **Handles long-form recordings (β‰₯30s)** - πŸš€ **Fast inference optimized with FP16 precision** --- ## 🚫 Limitations - May misinterpret **code-mixed (Swahili-English)** speech - Background noise and poor microphone quality reduce accuracy - Domain-specific (medical/legal) terms may be transcribed inaccurately - Performance may decline on **non-native Swahili speakers** --- ## πŸ”— Related Models | Model | Description | |--------|-------------| | [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) | Swahili instruction-tuned LLM for reasoning and dialogue | | [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) | Swahili text-to-speech (VITS) model for natural speech synthesis |