EYEDOL commited on
Commit
dc547c0
ยท
verified ยท
1 Parent(s): 0a3534a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -3
README.md CHANGED
@@ -1,3 +1,192 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - sw
5
+ base_model:
6
+ - facebook/mms-tts
7
+ pipeline_tag: text-to-speech
8
+ datasets:
9
+ - mozilla-foundation/common_voice_17_0
10
+ metrics:
11
+ - wer
12
+ tags:
13
+ - text-to-speech
14
+ - audio
15
+ - speech
16
+ - transformers
17
+ - vits
18
+ - swahili
19
+ ---
20
+
21
+
22
+ # ๐Ÿ”Š SALAMA-TTS โ€” Swahili Text-to-Speech Model
23
+
24
+ **Developer:** AI4NNOV
25
+ **Version:** v1.0
26
+ **License:** Apache 2.0
27
+ **Model Type:** Text-to-Speech (TTS)
28
+ **Base Model:** `facebook/mms-tts-swh` (fine-tuned)
29
+
30
+ ---
31
+
32
+ ## ๐ŸŒ Overview
33
+
34
+ **SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages.
35
+ It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants.
36
+
37
+ The model is based on **Metaโ€™s MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili.
38
+
39
+ ---
40
+
41
+ ## ๐Ÿงฑ Model Architecture
42
+
43
+ SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis.
44
+
45
+ | Parameter | Value |
46
+ |------------|--------|
47
+ | Base Model | `facebook/mms-tts-swh` |
48
+ | Fine-Tuning | 8-bit quantized, LoRA fine-tuning |
49
+ | Optimizer | AdamW |
50
+ | Learning Rate | 2e-5 |
51
+ | Epochs | 20 |
52
+ | Sampling Rate | 16kHz |
53
+ | Frameworks | Transformers + Datasets + PyTorch |
54
+ | Language | Swahili (`sw`) |
55
+
56
+ ---
57
+
58
+ ## ๐Ÿ“š Dataset
59
+
60
+ | Dataset | Description | Purpose |
61
+ |----------|--------------|----------|
62
+ | `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training |
63
+ | Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness |
64
+ | Evaluated on | Common Voice Swahili (test split) | Evaluation |
65
+
66
+ ---
67
+
68
+ ## ๐Ÿง  Model Capabilities
69
+
70
+ - Converts **Swahili text to natural-sounding speech**
71
+ - Handles **both formal and conversational** tone
72
+ - High clarity and prosody for long-form speech
73
+ - Seamless integration with **SALAMA-LLM** responses
74
+ - Output format: **16-bit PCM WAV**
75
+
76
+ ---
77
+
78
+ ## ๐Ÿ“Š Evaluation Metrics
79
+
80
+ | Metric | Score | Description |
81
+ |---------|-------|-------------|
82
+ | **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness |
83
+ | **WER (Generated โ†’ STT)** | **0.21** | Evaluated by re-transcribing synthesized audio |
84
+
85
+ > The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.
86
+
87
+ ---
88
+
89
+ ## โš™๏ธ Usage (Python Example)
90
+
91
+ ```python
92
+ # Requirements:
93
+ # pip install onnxruntime librosa soundfile transformers numpy
94
+ # If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)
95
+
96
+ import os
97
+ import numpy as np
98
+ import onnxruntime
99
+ from transformers import AutoTokenizer
100
+ import soundfile as sf
101
+
102
+ TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file
103
+ TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used
104
+ OUTPUT_SAMPLE_RATE = 16000
105
+ OUT_DIR = "tts_outputs"
106
+ os.makedirs(OUT_DIR, exist_ok=True)
107
+
108
+ def create_onnx_session(onnx_path: str):
109
+ """Create an ONNX Runtime session using GPU if available, otherwise CPU."""
110
+ providers = ["CPUExecutionProvider"]
111
+ try:
112
+ # prefer CUDA if available
113
+ providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
114
+ sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
115
+ print("Using CUDAExecutionProvider for ONNX Runtime.")
116
+ except Exception:
117
+ sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
118
+ print("CUDA not available โ€” using CPUExecutionProvider for ONNX Runtime.")
119
+ return sess
120
+
121
+ def generate_speech_from_onnx(text: str,
122
+ onnx_session: onnxruntime.InferenceSession,
123
+ tokenizer: AutoTokenizer,
124
+ out_path: str = None) -> str:
125
+ """
126
+ Synthesize speech from text using an ONNX TTS model.
127
+ Returns path to WAV file (16kHz, int16).
128
+ """
129
+ if not text:
130
+ raise ValueError("Empty text provided.")
131
+
132
+ # Tokenize to numpy inputs (match what the ONNX model expects)
133
+ # NOTE: many TTS tokenizers return {"input_ids": np.array(...)} โ€” adapt if your tokenizer differs
134
+ inputs = tokenizer(text, return_tensors="np", padding=True)
135
+ # Identify ONNX input name (assume first input)
136
+ input_name = onnx_session.get_inputs()[0].name
137
+
138
+ # Prepare ort_inputs dict using names expected by ONNX model
139
+ ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}
140
+
141
+ # Run ONNX inference
142
+ ort_outs = onnx_session.run(None, ort_inputs)
143
+
144
+ # The model should return a raw waveform or float array convertible to waveform.
145
+ # In many single-file TTS ONNX exports the first output is the waveform
146
+ audio_array = ort_outs[0]
147
+
148
+ # Flatten in case it's multi-dim and ensure 1-D waveform
149
+ audio_waveform = audio_array.flatten()
150
+
151
+ # If float waveform in -1..1, convert to int16; else try to coerce to int16
152
+ if np.issubdtype(audio_waveform.dtype, np.floating):
153
+ # clip then convert
154
+ audio_clip = np.clip(audio_waveform, -1.0, 1.0)
155
+ audio_int16 = (audio_clip * 32767.0).astype(np.int16)
156
+ else:
157
+ # if it's already int16-like, cast (safeguard)
158
+ audio_int16 = audio_waveform.astype(np.int16)
159
+
160
+ # Compose output filename
161
+ if out_path is None:
162
+ out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")
163
+
164
+ # Save with soundfile (16kHz)
165
+ sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
166
+ return out_path
167
+
168
+ if __name__ == "__main__":
169
+ # Example usage
170
+ sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
171
+ tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)
172
+
173
+ example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
174
+ out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
175
+ print("Saved synthesized audio to:", out_wav)
176
+
177
+ ```
178
+
179
+ **Example Output:**
180
+ > *Audio plays:* โ€œKaribu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.โ€
181
+
182
+ ---
183
+
184
+ ## โšก Key Features
185
+
186
+ - ๐Ÿ—ฃ๏ธ **Natural Swahili speech generation**
187
+ - ๐ŸŒ **Adapted for African tonal variations**
188
+ - ๐Ÿ”‰ **High clarity and rhythm**
189
+ - โš™๏ธ **Fast inference with FP16 precision**
190
+ - ๐Ÿ”— **Compatible with SALAMA-STT and SALAMA-LLM**
191
+
192
+