SALAMA_SM_ASR / README.md

Update README.md

dd2630c verified 2 months ago

4.56 kB

	---
	license: apache-2.0
	datasets:
	- mozilla-foundation/common_voice_17_0
	language:
	- sw
	metrics:
	- wer
	- cer
	base_model:
	- openai/whisper-small
	pipeline_tag: automatic-speech-recognition
	---


	# 🗣️ SALAMA-STT — Swahili Whisper ASR Model

	Developer: AI4NNOV
	Authors: AI4NNOV.
	Version: v1.0
	License: Apache 2.0
	Model Type: Automatic Speech Recognition (ASR)
	Base Model: `openai/whisper-small` (fine-tuned for Swahili)

	---
	## 🌍 Overview

	SALAMA-STT (Speech-to-Text) is the first module of the SALAMA Framework — a modular end-to-end speech-to-speech AI system built for African languages.
	This model is fine-tuned from OpenAI’s Whisper-small architecture for Swahili speech recognition, enhancing performance on African accents and conversational data.

	The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.

	---

	## 🧱 Model Architecture

	SALAMA-STT leverages the Whisper-small architecture with a transformer encoder-decoder optimized for low-resource Swahili audio transcription tasks.
	The model was fine-tuned on the Mozilla Common Voice 17.0 Swahili dataset, ensuring robustness to diverse accents and speech clarity.

	\| Parameter \| Value \|
	\|------------\|--------\|
	\| Base Model \| `openai/whisper-small` \|
	\| Fine-Tuning \| Full model fine-tuning (fp16 precision) \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 1e-5 \|
	\| Batch Size \| 16 \|
	\| Epochs \| 10 \|
	\| Frameworks \| Transformers + Datasets + TorchAudio \|
	\| Languages \| Swahili (`sw`), English (`en`) \|

	---
	## 📚 Dataset

	\| Dataset \| Description \| Purpose \|
	\|----------\|--------------\|----------\|
	\| `mozilla-foundation/common_voice_17_0` \| 20 hours of Swahili speech data \| Supervised fine-tuning \|
	\| Custom local Swahili recordings \| Conversational + accent-rich data \| Accent robustness \|
	\| Common Voice validation split \| 2.3 hours \| Evaluation \|

	---
	## 🧠 Model Capabilities

	- Speech-to-text transcription in Swahili
	- Recognition of African-accented Swahili
	- Handles short and long-form audio
	- Supports integration with SALAMA-LLM for full voice assistants
	- Provides timestamped segment transcriptions

	---
	## 📊 Evaluation Metrics

	\| Metric \| Baseline (Whisper-small) \| Fine-tuned (SALAMA-STT) \| Improvement \|
	\|---------\|---------------------------\|---------------------------\|--------------\|
	\| WER (Word Error Rate) \| 1.15 \| 0.43 \| 🔻 62% \|
	\| CER (Character Error Rate) \| 0.39 \| 0.18 \| 🔻 54% \|
	\| Accuracy \| 85.2% \| 95.4% \| +10.2% \|

	> Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.

	---
	## ⚙️ Usage (Python Example)

	Below is a quick example for Swahili speech transcription using this model:

	```python
	from transformers import pipeline

	# Load Swahili Whisper ASR
	asr_pipeline = pipeline(
	"automatic-speech-recognition",
	model="EYEDOL/salama-stt",
	chunk_length_s=30,
	device_map="auto"
	)

	# Example audio file (replace with your file)
	audio_path = "swahili_audio_sample.wav"

	# Transcribe audio
	result = asr_pipeline(audio_path)

	print("🗣️ Transcription:")
	print(result["text"])
	```

	Example Output:

	> “Karibu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”

	---
	## 🔍 Model Performance Summary

	\| Dataset \| Metric \| Score \|
	\|----------\|---------\|-------\|
	\| Common Voice 17.0 (test) \| WER \| 0.43 \|
	\| Common Voice 17.0 (test) \| CER \| 0.18 \|
	\| Local Swahili Test Set \| Accuracy \| 95.4% \|

	---
	## ⚡ Key Features

	- 🎙️ Accurate Swahili ASR trained on diverse voices
	- 🌍 Adapted for African speech variations and dialects
	- 🧩 Lightweight and compatible with SALAMA-LLM
	- 🔊 Handles long-form recordings (≥30s)
	- 🚀 Fast inference optimized with FP16 precision

	---
	## 🚫 Limitations

	- May misinterpret code-mixed (Swahili-English) speech
	- Background noise and poor microphone quality reduce accuracy
	- Domain-specific (medical/legal) terms may be transcribed inaccurately
	- Performance may decline on non-native Swahili speakers

	---
	## 🔗 Related Models

	\| Model \| Description \|
	\|--------\|-------------\|
	\| [`EYEDOL/salama-llm`](https://huggingface.co/EYEDOL/SALAMA_LLM) \| Swahili instruction-tuned LLM for reasoning and dialogue \|
	\| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/SALAMA_TTS) \| Swahili text-to-speech (VITS) model for natural speech synthesis \|