Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,192 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- sw
|
| 5 |
+
base_model:
|
| 6 |
+
- facebook/mms-tts
|
| 7 |
+
pipeline_tag: text-to-speech
|
| 8 |
+
datasets:
|
| 9 |
+
- mozilla-foundation/common_voice_17_0
|
| 10 |
+
metrics:
|
| 11 |
+
- wer
|
| 12 |
+
tags:
|
| 13 |
+
- text-to-speech
|
| 14 |
+
- audio
|
| 15 |
+
- speech
|
| 16 |
+
- transformers
|
| 17 |
+
- vits
|
| 18 |
+
- swahili
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
# ๐ SALAMA-TTS โ Swahili Text-to-Speech Model
|
| 23 |
+
|
| 24 |
+
**Developer:** AI4NNOV
|
| 25 |
+
**Version:** v1.0
|
| 26 |
+
**License:** Apache 2.0
|
| 27 |
+
**Model Type:** Text-to-Speech (TTS)
|
| 28 |
+
**Base Model:** `facebook/mms-tts-swh` (fine-tuned)
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## ๐ Overview
|
| 33 |
+
|
| 34 |
+
**SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages.
|
| 35 |
+
It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants.
|
| 36 |
+
|
| 37 |
+
The model is based on **Metaโs MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## ๐งฑ Model Architecture
|
| 42 |
+
|
| 43 |
+
SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis.
|
| 44 |
+
|
| 45 |
+
| Parameter | Value |
|
| 46 |
+
|------------|--------|
|
| 47 |
+
| Base Model | `facebook/mms-tts-swh` |
|
| 48 |
+
| Fine-Tuning | 8-bit quantized, LoRA fine-tuning |
|
| 49 |
+
| Optimizer | AdamW |
|
| 50 |
+
| Learning Rate | 2e-5 |
|
| 51 |
+
| Epochs | 20 |
|
| 52 |
+
| Sampling Rate | 16kHz |
|
| 53 |
+
| Frameworks | Transformers + Datasets + PyTorch |
|
| 54 |
+
| Language | Swahili (`sw`) |
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## ๐ Dataset
|
| 59 |
+
|
| 60 |
+
| Dataset | Description | Purpose |
|
| 61 |
+
|----------|--------------|----------|
|
| 62 |
+
| `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training |
|
| 63 |
+
| Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness |
|
| 64 |
+
| Evaluated on | Common Voice Swahili (test split) | Evaluation |
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## ๐ง Model Capabilities
|
| 69 |
+
|
| 70 |
+
- Converts **Swahili text to natural-sounding speech**
|
| 71 |
+
- Handles **both formal and conversational** tone
|
| 72 |
+
- High clarity and prosody for long-form speech
|
| 73 |
+
- Seamless integration with **SALAMA-LLM** responses
|
| 74 |
+
- Output format: **16-bit PCM WAV**
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## ๐ Evaluation Metrics
|
| 79 |
+
|
| 80 |
+
| Metric | Score | Description |
|
| 81 |
+
|---------|-------|-------------|
|
| 82 |
+
| **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness |
|
| 83 |
+
| **WER (Generated โ STT)** | **0.21** | Evaluated by re-transcribing synthesized audio |
|
| 84 |
+
|
| 85 |
+
> The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## โ๏ธ Usage (Python Example)
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
# Requirements:
|
| 93 |
+
# pip install onnxruntime librosa soundfile transformers numpy
|
| 94 |
+
# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)
|
| 95 |
+
|
| 96 |
+
import os
|
| 97 |
+
import numpy as np
|
| 98 |
+
import onnxruntime
|
| 99 |
+
from transformers import AutoTokenizer
|
| 100 |
+
import soundfile as sf
|
| 101 |
+
|
| 102 |
+
TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file
|
| 103 |
+
TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used
|
| 104 |
+
OUTPUT_SAMPLE_RATE = 16000
|
| 105 |
+
OUT_DIR = "tts_outputs"
|
| 106 |
+
os.makedirs(OUT_DIR, exist_ok=True)
|
| 107 |
+
|
| 108 |
+
def create_onnx_session(onnx_path: str):
|
| 109 |
+
"""Create an ONNX Runtime session using GPU if available, otherwise CPU."""
|
| 110 |
+
providers = ["CPUExecutionProvider"]
|
| 111 |
+
try:
|
| 112 |
+
# prefer CUDA if available
|
| 113 |
+
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
|
| 114 |
+
sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
|
| 115 |
+
print("Using CUDAExecutionProvider for ONNX Runtime.")
|
| 116 |
+
except Exception:
|
| 117 |
+
sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
|
| 118 |
+
print("CUDA not available โ using CPUExecutionProvider for ONNX Runtime.")
|
| 119 |
+
return sess
|
| 120 |
+
|
| 121 |
+
def generate_speech_from_onnx(text: str,
|
| 122 |
+
onnx_session: onnxruntime.InferenceSession,
|
| 123 |
+
tokenizer: AutoTokenizer,
|
| 124 |
+
out_path: str = None) -> str:
|
| 125 |
+
"""
|
| 126 |
+
Synthesize speech from text using an ONNX TTS model.
|
| 127 |
+
Returns path to WAV file (16kHz, int16).
|
| 128 |
+
"""
|
| 129 |
+
if not text:
|
| 130 |
+
raise ValueError("Empty text provided.")
|
| 131 |
+
|
| 132 |
+
# Tokenize to numpy inputs (match what the ONNX model expects)
|
| 133 |
+
# NOTE: many TTS tokenizers return {"input_ids": np.array(...)} โ adapt if your tokenizer differs
|
| 134 |
+
inputs = tokenizer(text, return_tensors="np", padding=True)
|
| 135 |
+
# Identify ONNX input name (assume first input)
|
| 136 |
+
input_name = onnx_session.get_inputs()[0].name
|
| 137 |
+
|
| 138 |
+
# Prepare ort_inputs dict using names expected by ONNX model
|
| 139 |
+
ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}
|
| 140 |
+
|
| 141 |
+
# Run ONNX inference
|
| 142 |
+
ort_outs = onnx_session.run(None, ort_inputs)
|
| 143 |
+
|
| 144 |
+
# The model should return a raw waveform or float array convertible to waveform.
|
| 145 |
+
# In many single-file TTS ONNX exports the first output is the waveform
|
| 146 |
+
audio_array = ort_outs[0]
|
| 147 |
+
|
| 148 |
+
# Flatten in case it's multi-dim and ensure 1-D waveform
|
| 149 |
+
audio_waveform = audio_array.flatten()
|
| 150 |
+
|
| 151 |
+
# If float waveform in -1..1, convert to int16; else try to coerce to int16
|
| 152 |
+
if np.issubdtype(audio_waveform.dtype, np.floating):
|
| 153 |
+
# clip then convert
|
| 154 |
+
audio_clip = np.clip(audio_waveform, -1.0, 1.0)
|
| 155 |
+
audio_int16 = (audio_clip * 32767.0).astype(np.int16)
|
| 156 |
+
else:
|
| 157 |
+
# if it's already int16-like, cast (safeguard)
|
| 158 |
+
audio_int16 = audio_waveform.astype(np.int16)
|
| 159 |
+
|
| 160 |
+
# Compose output filename
|
| 161 |
+
if out_path is None:
|
| 162 |
+
out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")
|
| 163 |
+
|
| 164 |
+
# Save with soundfile (16kHz)
|
| 165 |
+
sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
|
| 166 |
+
return out_path
|
| 167 |
+
|
| 168 |
+
if __name__ == "__main__":
|
| 169 |
+
# Example usage
|
| 170 |
+
sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
|
| 171 |
+
tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)
|
| 172 |
+
|
| 173 |
+
example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
|
| 174 |
+
out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
|
| 175 |
+
print("Saved synthesized audio to:", out_wav)
|
| 176 |
+
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
**Example Output:**
|
| 180 |
+
> *Audio plays:* โKaribu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.โ
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## โก Key Features
|
| 185 |
+
|
| 186 |
+
- ๐ฃ๏ธ **Natural Swahili speech generation**
|
| 187 |
+
- ๐ **Adapted for African tonal variations**
|
| 188 |
+
- ๐ **High clarity and rhythm**
|
| 189 |
+
- โ๏ธ **Fast inference with FP16 precision**
|
| 190 |
+
- ๐ **Compatible with SALAMA-STT and SALAMA-LLM**
|
| 191 |
+
|
| 192 |
+
|