EYEDOL
/

SALAMA_TTS

+---
+license: apache-2.0
+language:
+- sw
+base_model:
+- facebook/mms-tts
+pipeline_tag: text-to-speech
+datasets:
+- mozilla-foundation/common_voice_17_0
+metrics:
+- wer
+tags:
+- text-to-speech
+- audio
+- speech
+- transformers
+- vits
+- swahili
+---
+# 🔊 SALAMA-TTS — Swahili Text-to-Speech Model
+**Developer:** AI4NNOV
+**Version:** v1.0
+**License:** Apache 2.0
+**Model Type:** Text-to-Speech (TTS)
+**Base Model:** `facebook/mms-tts-swh` (fine-tuned)
+---
+## 🌍 Overview
+**SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages.
+It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants.
+The model is based on **Meta’s MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili.
+---
+## 🧱 Model Architecture
+SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis.
+| Parameter | Value |
+|------------|--------|
+| Base Model | `facebook/mms-tts-swh` |
+| Fine-Tuning | 8-bit quantized, LoRA fine-tuning |
+| Optimizer | AdamW |
+| Learning Rate | 2e-5 |
+| Epochs | 20 |
+| Sampling Rate | 16kHz |
+| Frameworks | Transformers + Datasets + PyTorch |
+| Language | Swahili (`sw`) |
+---
+## 📚 Dataset
+| Dataset | Description | Purpose |
+|----------|--------------|----------|
+| `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training |
+| Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness |
+| Evaluated on | Common Voice Swahili (test split) | Evaluation |
+---
+## 🧠 Model Capabilities
+- Converts **Swahili text to natural-sounding speech**
+- Handles **both formal and conversational** tone
+- High clarity and prosody for long-form speech
+- Seamless integration with **SALAMA-LLM** responses
+- Output format: **16-bit PCM WAV**
+---
+## 📊 Evaluation Metrics
+| Metric | Score | Description |
+|---------|-------|-------------|
+| **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness |
+| **WER (Generated → STT)** | **0.21** | Evaluated by re-transcribing synthesized audio |
+> The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.
+---
+## ⚙️ Usage (Python Example)
+```python
+# Requirements:
+# pip install onnxruntime librosa soundfile transformers numpy
+# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)
+import os
+import numpy as np
+import onnxruntime
+from transformers import AutoTokenizer
+import soundfile as sf
+TTS_ONNX_MODEL_PATH = "swahili_tts.onnx"   # path to your .onnx file
+TTS_TOKENIZER_ID = "facebook/mms-tts-swh"  # or whichever tokenizer you used
+OUTPUT_SAMPLE_RATE = 16000
+OUT_DIR = "tts_outputs"
+os.makedirs(OUT_DIR, exist_ok=True)
+def create_onnx_session(onnx_path: str):
+    """Create an ONNX Runtime session using GPU if available, otherwise CPU."""
+    providers = ["CPUExecutionProvider"]
+    try:
+        # prefer CUDA if available
+        providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
+        sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
+        print("Using CUDAExecutionProvider for ONNX Runtime.")
+    except Exception:
+        sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
+        print("CUDA not available — using CPUExecutionProvider for ONNX Runtime.")
+    return sess
+def generate_speech_from_onnx(text: str,
+                              onnx_session: onnxruntime.InferenceSession,
+                              tokenizer: AutoTokenizer,
+                              out_path: str = None) -> str:
+    """
+    Synthesize speech from text using an ONNX TTS model.
+    Returns path to WAV file (16kHz, int16).
+    """
+    if not text:
+        raise ValueError("Empty text provided.")
+    # Tokenize to numpy inputs (match what the ONNX model expects)
+    # NOTE: many TTS tokenizers return {"input_ids": np.array(...)} — adapt if your tokenizer differs
+    inputs = tokenizer(text, return_tensors="np", padding=True)
+    # Identify ONNX input name (assume first input)
+    input_name = onnx_session.get_inputs()[0].name
+    # Prepare ort_inputs dict using names expected by ONNX model
+    ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}
+    # Run ONNX inference
+    ort_outs = onnx_session.run(None, ort_inputs)
+    # The model should return a raw waveform or float array convertible to waveform.
+    # In many single-file TTS ONNX exports the first output is the waveform
+    audio_array = ort_outs[0]
+    # Flatten in case it's multi-dim and ensure 1-D waveform
+    audio_waveform = audio_array.flatten()
+    # If float waveform in -1..1, convert to int16; else try to coerce to int16
+    if np.issubdtype(audio_waveform.dtype, np.floating):
+        # clip then convert
+        audio_clip = np.clip(audio_waveform, -1.0, 1.0)
+        audio_int16 = (audio_clip * 32767.0).astype(np.int16)
+    else:
+        # if it's already int16-like, cast (safeguard)
+        audio_int16 = audio_waveform.astype(np.int16)
+    # Compose output filename
+    if out_path is None:
+        out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")
+    # Save with soundfile (16kHz)
+    sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
+    return out_path
+if __name__ == "__main__":
+    # Example usage
+    sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
+    tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)
+    example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
+    out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
+    print("Saved synthesized audio to:", out_wav)
+```
+**Example Output:**
+> *Audio plays:* “Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.”
+---
+## ⚡ Key Features
+- 🗣️ **Natural Swahili speech generation**
+- 🌍 **Adapted for African tonal variations**
+- 🔉 **High clarity and rhythm**
+- ⚙️ **Fast inference with FP16 precision**
+- 🔗 **Compatible with SALAMA-STT and SALAMA-LLM**