Spark-TTS for Ugandan Languages (SALT)
Text-to-speech model fine-tuned from Spark-TTS-0.5B for seven languages widely spoken in Uganda: Acholi, Ateso, English (Ugandan accent), Luganda, Lugbara, Runyankore, and Swahili.
Architecture
- Base Model: Spark-TTS-0.5B (based on Qwen2.5-Instruct)
- Audio Codec: BiCodec tokenizer with dual-token architecture
- Global tokens: Speaker characteristics and prosody (controllable)
- Semantic tokens: Linguistic content and phonetic structure
- Input: Text with speaker ID prefix + reference audio for voice cloning
- Output: Discrete audio tokens → 24kHz waveform via BiCodec detokenizer
Training
Fine-tuned on the SALT studio dataset across 7 languages with speaker-aware conditioning. Training used full fine-tuning (not LoRA) in float32 precision with the following data mixture:
| Speaker ID | Language | Gender |
|---|---|---|
| 241 | Acholi | Female |
| 242 | Ateso | Female |
| 243 | Runyankore | Female |
| 245 | Lugbara | Female |
| 246 | Swahili | Male |
| 248 | Luganda | Female |
Training Configuration:
- Max sequence length: 2048 tokens
- Learning rate: 2e-4
- Optimizer: AdamW 8-bit
- Epochs: 1 (~12k samples)
- Batch size: 4 (gradient accumulation: 2)
Audio preprocessing included volume normalization, resampling to 24kHz, and Wav2Vec2 feature extraction (layers 11, 14, 16 averaged) for semantic tokenization.
Usage
Requirements
pip install unsloth transformers torch datasets soundfile librosa
pip install omegaconf einx einops torchaudio
# Clone Spark-TTS repository for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS && pip install -e .
The BiCodec tokenizer requires the original Spark-TTS repository and model weights (unsloth/Spark-TTS-0.5B) for audio encoding/decoding. This model only contains the fine-tuned LLM weights
import torch
import re
import numpy as np
from unsloth import FastModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer
# Load model and tokenizer
model, tokenizer = FastModel.from_pretrained(
"Sunbird/spark-tts-salt",
max_seq_length=2048,
dtype=torch.float32
)
FastModel.for_inference(model)
# Initialize audio tokenizer (requires Spark-TTS repo)
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
@torch.inference_mode()
def generate_speech(text: str, temperature: float = 0.8,
top_k: int = 50, top_p: float = 1.0) -> np.ndarray:
"""
Generate speech from text with speaker control.
Format: "{speaker_id}: {text}" (e.g., "248: Oli otya?")
"""
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id
)
# Extract generated tokens
output = tokenizer.batch_decode(
generated[:, inputs.input_ids.shape[1]:],
skip_special_tokens=False
)[0]
# Parse semantic tokens
semantic_ids = torch.tensor([
int(m) for m in re.findall(r"<\|bicodec_semantic_(\d+)\|>", output)
]).long().unsqueeze(0)
# Parse global tokens (speaker characteristics)
global_ids = torch.tensor([
int(m) for m in re.findall(r"<\|bicodec_global_(\d+)\|>", output)
]).long().unsqueeze(0).unsqueeze(0)
# Decode to audio
return audio_tokenizer.detokenize(global_ids, semantic_ids)
# Generate speech
audio = generate_speech("248: Oli otya? Nno nomuwoomera.") # Luganda greeting
Voice Cloning
To clone a voice from reference audio:
# Extract speaker characteristics from reference
ref_global_ids, _ = audio_tokenizer.tokenize('reference.wav')
# Generate new speech with cloned voice
text = "243: This is new content in the cloned voice."
_, semantic_ids = generate_speech(text) # Get semantic only
audio = audio_tokenizer.detokenize(ref_global_ids, semantic_ids)
Limitations
- Speaker consistency: Zero-shot voice cloning quality varies with reference audio quality and length
- Language mixing: Code-switching between languages may produce inconsistent prosody
- Out-of-distribution speakers: Performance degrades for voices significantly different from training speakers
- Audio length: Limited to ~8 seconds of audio context during training; longer utterances may truncate
- Hardware: Requires CUDA for inference; float32 precision needed (no 4-bit/8-bit support verified)
Model Details
- Developed by: Sunbird AI
- Model type: Causal language model for discrete audio token prediction
- Parameters: 0.5B (active), full fine-tuning
- License: MIT
- Repository: https://github.com/SparkAudio/Spark-TTS
- Downloads last month
- 207