|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- sw |
|
|
base_model: |
|
|
- facebook/mms-tts |
|
|
pipeline_tag: text-to-speech |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
metrics: |
|
|
- wer |
|
|
tags: |
|
|
- text-to-speech |
|
|
- audio |
|
|
- speech |
|
|
- transformers |
|
|
- vits |
|
|
- swahili |
|
|
--- |
|
|
|
|
|
|
|
|
# ๐ SALAMA-TTS โ Swahili Text-to-Speech Model |
|
|
|
|
|
**Developer:** AI4NNOV |
|
|
|
|
|
**Version:** v1.0 |
|
|
**License:** Apache 2.0 |
|
|
**Model Type:** Text-to-Speech (TTS) |
|
|
**Base Model:** `facebook/mms-tts-swh` (fine-tuned) |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Overview |
|
|
|
|
|
**SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages. |
|
|
It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants. |
|
|
|
|
|
The model is based on **Metaโs MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐งฑ Model Architecture |
|
|
|
|
|
SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis. |
|
|
|
|
|
| Parameter | Value | |
|
|
|------------|--------| |
|
|
| Base Model | `facebook/mms-tts-swh` | |
|
|
| Fine-Tuning | 8-bit quantized, LoRA fine-tuning | |
|
|
| Optimizer | AdamW | |
|
|
| Learning Rate | 2e-5 | |
|
|
| Epochs | 20 | |
|
|
| Sampling Rate | 16kHz | |
|
|
| Frameworks | Transformers + Datasets + PyTorch | |
|
|
| Language | Swahili (`sw`) | |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Dataset |
|
|
|
|
|
| Dataset | Description | Purpose | |
|
|
|----------|--------------|----------| |
|
|
| `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training | |
|
|
| Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness | |
|
|
| Evaluated on | Common Voice Swahili (test split) | Evaluation | |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Model Capabilities |
|
|
|
|
|
- Converts **Swahili text to natural-sounding speech** |
|
|
- Handles **both formal and conversational** tone |
|
|
- High clarity and prosody for long-form speech |
|
|
- Seamless integration with **SALAMA-LLM** responses |
|
|
- Output format: **16-bit PCM WAV** |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Evaluation Metrics |
|
|
|
|
|
| Metric | Score | Description | |
|
|
|---------|-------|-------------| |
|
|
| **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness | |
|
|
| **WER (Generated โ STT)** | **0.21** | Evaluated by re-transcribing synthesized audio | |
|
|
|
|
|
> The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation. |
|
|
|
|
|
--- |
|
|
|
|
|
## โ๏ธ Usage (Python Example) |
|
|
|
|
|
```python |
|
|
# Requirements: |
|
|
# pip install onnxruntime librosa soundfile transformers numpy |
|
|
# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available) |
|
|
|
|
|
import os |
|
|
import numpy as np |
|
|
import onnxruntime |
|
|
from transformers import AutoTokenizer |
|
|
import soundfile as sf |
|
|
|
|
|
TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file |
|
|
TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used |
|
|
OUTPUT_SAMPLE_RATE = 16000 |
|
|
OUT_DIR = "tts_outputs" |
|
|
os.makedirs(OUT_DIR, exist_ok=True) |
|
|
|
|
|
def create_onnx_session(onnx_path: str): |
|
|
"""Create an ONNX Runtime session using GPU if available, otherwise CPU.""" |
|
|
providers = ["CPUExecutionProvider"] |
|
|
try: |
|
|
# prefer CUDA if available |
|
|
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"] |
|
|
sess = onnxruntime.InferenceSession(onnx_path, providers=providers) |
|
|
print("Using CUDAExecutionProvider for ONNX Runtime.") |
|
|
except Exception: |
|
|
sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"]) |
|
|
print("CUDA not available โ using CPUExecutionProvider for ONNX Runtime.") |
|
|
return sess |
|
|
|
|
|
def generate_speech_from_onnx(text: str, |
|
|
onnx_session: onnxruntime.InferenceSession, |
|
|
tokenizer: AutoTokenizer, |
|
|
out_path: str = None) -> str: |
|
|
""" |
|
|
Synthesize speech from text using an ONNX TTS model. |
|
|
Returns path to WAV file (16kHz, int16). |
|
|
""" |
|
|
if not text: |
|
|
raise ValueError("Empty text provided.") |
|
|
|
|
|
# Tokenize to numpy inputs (match what the ONNX model expects) |
|
|
# NOTE: many TTS tokenizers return {"input_ids": np.array(...)} โ adapt if your tokenizer differs |
|
|
inputs = tokenizer(text, return_tensors="np", padding=True) |
|
|
# Identify ONNX input name (assume first input) |
|
|
input_name = onnx_session.get_inputs()[0].name |
|
|
|
|
|
# Prepare ort_inputs dict using names expected by ONNX model |
|
|
ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)} |
|
|
|
|
|
# Run ONNX inference |
|
|
ort_outs = onnx_session.run(None, ort_inputs) |
|
|
|
|
|
# The model should return a raw waveform or float array convertible to waveform. |
|
|
# In many single-file TTS ONNX exports the first output is the waveform |
|
|
audio_array = ort_outs[0] |
|
|
|
|
|
# Flatten in case it's multi-dim and ensure 1-D waveform |
|
|
audio_waveform = audio_array.flatten() |
|
|
|
|
|
# If float waveform in -1..1, convert to int16; else try to coerce to int16 |
|
|
if np.issubdtype(audio_waveform.dtype, np.floating): |
|
|
# clip then convert |
|
|
audio_clip = np.clip(audio_waveform, -1.0, 1.0) |
|
|
audio_int16 = (audio_clip * 32767.0).astype(np.int16) |
|
|
else: |
|
|
# if it's already int16-like, cast (safeguard) |
|
|
audio_int16 = audio_waveform.astype(np.int16) |
|
|
|
|
|
# Compose output filename |
|
|
if out_path is None: |
|
|
out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav") |
|
|
|
|
|
# Save with soundfile (16kHz) |
|
|
sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16") |
|
|
return out_path |
|
|
|
|
|
if __name__ == "__main__": |
|
|
# Example usage |
|
|
sess = create_onnx_session(TTS_ONNX_MODEL_PATH) |
|
|
tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID) |
|
|
|
|
|
example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili." |
|
|
out_wav = generate_speech_from_onnx(example_text, sess, tokenizer) |
|
|
print("Saved synthesized audio to:", out_wav) |
|
|
|
|
|
``` |
|
|
|
|
|
**Example Output:** |
|
|
> *Audio plays:* โKaribu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.โ |
|
|
|
|
|
--- |
|
|
|
|
|
## โก Key Features |
|
|
|
|
|
- ๐ฃ๏ธ **Natural Swahili speech generation** |
|
|
- ๐ **Adapted for African tonal variations** |
|
|
- ๐ **High clarity and rhythm** |
|
|
- โ๏ธ **Fast inference with FP16 precision** |
|
|
- ๐ **Compatible with SALAMA-STT and SALAMA-LLM** |
|
|
|
|
|
|
|
|
|