SALAMA_TTS / README.md

Update README.md

6b06f38 verified 2 months ago

6.41 kB

	---
	license: apache-2.0
	language:
	- sw
	base_model:
	- facebook/mms-tts
	pipeline_tag: text-to-speech
	datasets:
	- mozilla-foundation/common_voice_17_0
	metrics:
	- wer
	tags:
	- text-to-speech
	- audio
	- speech
	- transformers
	- vits
	- swahili
	---


	# 🔊 SALAMA-TTS — Swahili Text-to-Speech Model

	Developer: AI4NNOV

	Version: v1.0
	License: Apache 2.0
	Model Type: Text-to-Speech (TTS)
	Base Model: `facebook/mms-tts-swh` (fine-tuned)

	---

	## 🌍 Overview

	SALAMA-TTS is the speech synthesis module of the SALAMA Framework, a complete end-to-end Speech-to-Speech AI system for African languages.
	It generates natural, high-quality Swahili speech from text and integrates seamlessly with SALAMA-LLM and SALAMA-STT for conversational voice assistants.

	The model is based on Meta’s MMS (Massively Multilingual Speech) TTS architecture using the VITS framework, fine-tuned for natural prosody, tone, and rhythm in Swahili.

	---

	## 🧱 Model Architecture

	SALAMA-TTS is built on the VITS architecture, combining the strengths of variational autoencoders (VAE) and GANs for realistic and expressive speech synthesis.

	\| Parameter \| Value \|
	\|------------\|--------\|
	\| Base Model \| `facebook/mms-tts-swh` \|
	\| Fine-Tuning \| 8-bit quantized, LoRA fine-tuning \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 2e-5 \|
	\| Epochs \| 20 \|
	\| Sampling Rate \| 16kHz \|
	\| Frameworks \| Transformers + Datasets + PyTorch \|
	\| Language \| Swahili (`sw`) \|

	---

	## 📚 Dataset

	\| Dataset \| Description \| Purpose \|
	\|----------\|--------------\|----------\|
	\| `common_voice_17_0` \| Swahili voice dataset by Mozilla \| Base training \|
	\| Custom Swahili speech corpus \| Locally recorded sentences and dialogues \| Fine-tuning naturalness \|
	\| Evaluated on \| Common Voice Swahili (test split) \| Evaluation \|

	---

	## 🧠 Model Capabilities

	- Converts Swahili text to natural-sounding speech
	- Handles both formal and conversational tone
	- High clarity and prosody for long-form speech
	- Seamless integration with SALAMA-LLM responses
	- Output format: 16-bit PCM WAV

	---

	## 📊 Evaluation Metrics

	\| Metric \| Score \| Description \|
	\|---------\|-------\|-------------\|
	\| MOS (Mean Opinion Score) \| 4.05 / 5.0 \| Human-rated naturalness \|
	\| WER (Generated → STT) \| 0.21 \| Evaluated by re-transcribing synthesized audio \|

	> The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.

	---

	## ⚙️ Usage (Python Example)

	```python
	# Requirements:
	# pip install onnxruntime librosa soundfile transformers numpy
	# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)

	import os
	import numpy as np
	import onnxruntime
	from transformers import AutoTokenizer
	import soundfile as sf

	TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file
	TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used
	OUTPUT_SAMPLE_RATE = 16000
	OUT_DIR = "tts_outputs"
	os.makedirs(OUT_DIR, exist_ok=True)

	def create_onnx_session(onnx_path: str):
	"""Create an ONNX Runtime session using GPU if available, otherwise CPU."""
	providers = ["CPUExecutionProvider"]
	try:
	# prefer CUDA if available
	providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
	sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
	print("Using CUDAExecutionProvider for ONNX Runtime.")
	except Exception:
	sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
	print("CUDA not available — using CPUExecutionProvider for ONNX Runtime.")
	return sess

	def generate_speech_from_onnx(text: str,
	onnx_session: onnxruntime.InferenceSession,
	tokenizer: AutoTokenizer,
	out_path: str = None) -> str:
	"""
	Synthesize speech from text using an ONNX TTS model.
	Returns path to WAV file (16kHz, int16).
	"""
	if not text:
	raise ValueError("Empty text provided.")

	# Tokenize to numpy inputs (match what the ONNX model expects)
	# NOTE: many TTS tokenizers return {"input_ids": np.array(...)} — adapt if your tokenizer differs
	inputs = tokenizer(text, return_tensors="np", padding=True)
	# Identify ONNX input name (assume first input)
	input_name = onnx_session.get_inputs()[0].name

	# Prepare ort_inputs dict using names expected by ONNX model
	ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}

	# Run ONNX inference
	ort_outs = onnx_session.run(None, ort_inputs)

	# The model should return a raw waveform or float array convertible to waveform.
	# In many single-file TTS ONNX exports the first output is the waveform
	audio_array = ort_outs[0]

	# Flatten in case it's multi-dim and ensure 1-D waveform
	audio_waveform = audio_array.flatten()

	# If float waveform in -1..1, convert to int16; else try to coerce to int16
	if np.issubdtype(audio_waveform.dtype, np.floating):
	# clip then convert
	audio_clip = np.clip(audio_waveform, -1.0, 1.0)
	audio_int16 = (audio_clip * 32767.0).astype(np.int16)
	else:
	# if it's already int16-like, cast (safeguard)
	audio_int16 = audio_waveform.astype(np.int16)

	# Compose output filename
	if out_path is None:
	out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")

	# Save with soundfile (16kHz)
	sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
	return out_path

	if __name__ == "__main__":
	# Example usage
	sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
	tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)

	example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
	out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
	print("Saved synthesized audio to:", out_wav)

	```

	Example Output:
	> Audio plays: “Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.”

	---

	## ⚡ Key Features

	- 🗣️ Natural Swahili speech generation
	- 🌍 Adapted for African tonal variations
	- 🔉 High clarity and rhythm
	- ⚙️ Fast inference with FP16 precision
	- 🔗 Compatible with SALAMA-STT and SALAMA-LLM