SALAMA_TTS / README.md
EYEDOL's picture
Update README.md
6b06f38 verified
---
license: apache-2.0
language:
- sw
base_model:
- facebook/mms-tts
pipeline_tag: text-to-speech
datasets:
- mozilla-foundation/common_voice_17_0
metrics:
- wer
tags:
- text-to-speech
- audio
- speech
- transformers
- vits
- swahili
---
# ๐Ÿ”Š SALAMA-TTS โ€” Swahili Text-to-Speech Model
**Developer:** AI4NNOV
**Version:** v1.0
**License:** Apache 2.0
**Model Type:** Text-to-Speech (TTS)
**Base Model:** `facebook/mms-tts-swh` (fine-tuned)
---
## ๐ŸŒ Overview
**SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages.
It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants.
The model is based on **Metaโ€™s MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili.
---
## ๐Ÿงฑ Model Architecture
SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis.
| Parameter | Value |
|------------|--------|
| Base Model | `facebook/mms-tts-swh` |
| Fine-Tuning | 8-bit quantized, LoRA fine-tuning |
| Optimizer | AdamW |
| Learning Rate | 2e-5 |
| Epochs | 20 |
| Sampling Rate | 16kHz |
| Frameworks | Transformers + Datasets + PyTorch |
| Language | Swahili (`sw`) |
---
## ๐Ÿ“š Dataset
| Dataset | Description | Purpose |
|----------|--------------|----------|
| `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training |
| Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness |
| Evaluated on | Common Voice Swahili (test split) | Evaluation |
---
## ๐Ÿง  Model Capabilities
- Converts **Swahili text to natural-sounding speech**
- Handles **both formal and conversational** tone
- High clarity and prosody for long-form speech
- Seamless integration with **SALAMA-LLM** responses
- Output format: **16-bit PCM WAV**
---
## ๐Ÿ“Š Evaluation Metrics
| Metric | Score | Description |
|---------|-------|-------------|
| **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness |
| **WER (Generated โ†’ STT)** | **0.21** | Evaluated by re-transcribing synthesized audio |
> The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.
---
## โš™๏ธ Usage (Python Example)
```python
# Requirements:
# pip install onnxruntime librosa soundfile transformers numpy
# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)
import os
import numpy as np
import onnxruntime
from transformers import AutoTokenizer
import soundfile as sf
TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file
TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used
OUTPUT_SAMPLE_RATE = 16000
OUT_DIR = "tts_outputs"
os.makedirs(OUT_DIR, exist_ok=True)
def create_onnx_session(onnx_path: str):
"""Create an ONNX Runtime session using GPU if available, otherwise CPU."""
providers = ["CPUExecutionProvider"]
try:
# prefer CUDA if available
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
print("Using CUDAExecutionProvider for ONNX Runtime.")
except Exception:
sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
print("CUDA not available โ€” using CPUExecutionProvider for ONNX Runtime.")
return sess
def generate_speech_from_onnx(text: str,
onnx_session: onnxruntime.InferenceSession,
tokenizer: AutoTokenizer,
out_path: str = None) -> str:
"""
Synthesize speech from text using an ONNX TTS model.
Returns path to WAV file (16kHz, int16).
"""
if not text:
raise ValueError("Empty text provided.")
# Tokenize to numpy inputs (match what the ONNX model expects)
# NOTE: many TTS tokenizers return {"input_ids": np.array(...)} โ€” adapt if your tokenizer differs
inputs = tokenizer(text, return_tensors="np", padding=True)
# Identify ONNX input name (assume first input)
input_name = onnx_session.get_inputs()[0].name
# Prepare ort_inputs dict using names expected by ONNX model
ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}
# Run ONNX inference
ort_outs = onnx_session.run(None, ort_inputs)
# The model should return a raw waveform or float array convertible to waveform.
# In many single-file TTS ONNX exports the first output is the waveform
audio_array = ort_outs[0]
# Flatten in case it's multi-dim and ensure 1-D waveform
audio_waveform = audio_array.flatten()
# If float waveform in -1..1, convert to int16; else try to coerce to int16
if np.issubdtype(audio_waveform.dtype, np.floating):
# clip then convert
audio_clip = np.clip(audio_waveform, -1.0, 1.0)
audio_int16 = (audio_clip * 32767.0).astype(np.int16)
else:
# if it's already int16-like, cast (safeguard)
audio_int16 = audio_waveform.astype(np.int16)
# Compose output filename
if out_path is None:
out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")
# Save with soundfile (16kHz)
sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
return out_path
if __name__ == "__main__":
# Example usage
sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)
example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
print("Saved synthesized audio to:", out_wav)
```
**Example Output:**
> *Audio plays:* โ€œKaribu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.โ€
---
## โšก Key Features
- ๐Ÿ—ฃ๏ธ **Natural Swahili speech generation**
- ๐ŸŒ **Adapted for African tonal variations**
- ๐Ÿ”‰ **High clarity and rhythm**
- โš™๏ธ **Fast inference with FP16 precision**
- ๐Ÿ”— **Compatible with SALAMA-STT and SALAMA-LLM**