You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Spark-TTS for Ugandan Languages (SALT)

Text-to-speech model fine-tuned from Spark-TTS-0.5B for seven languages widely spoken in Uganda: Acholi, Ateso, English (Ugandan accent), Luganda, Lugbara, Runyankore, and Swahili.

Architecture

Base Model: Spark-TTS-0.5B (based on Qwen2.5-Instruct)
Audio Codec: BiCodec tokenizer with dual-token architecture
- Global tokens: Speaker characteristics and prosody (controllable)
- Semantic tokens: Linguistic content and phonetic structure
Input: Text with speaker ID prefix + reference audio for voice cloning
Output: Discrete audio tokens → 24kHz waveform via BiCodec detokenizer

Training

Fine-tuned on the SALT studio dataset across 7 languages with speaker-aware conditioning. Training used full fine-tuning (not LoRA) in float32 precision with the following data mixture:

Speaker ID	Language	Gender
241	Acholi	Female
242	Ateso	Female
243	Runyankore	Female
245	Lugbara	Female
246	Swahili	Male
248	Luganda	Female

Training Configuration:

Max sequence length: 2048 tokens
Learning rate: 2e-4
Optimizer: AdamW 8-bit
Epochs: 1 (~12k samples)
Batch size: 4 (gradient accumulation: 2)

Audio preprocessing included volume normalization, resampling to 24kHz, and Wav2Vec2 feature extraction (layers 11, 14, 16 averaged) for semantic tokenization.

Usage

Requirements

pip install unsloth transformers torch datasets soundfile librosa
pip install omegaconf einx einops torchaudio

# Clone Spark-TTS repository for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS && pip install -e .

The BiCodec tokenizer requires the original Spark-TTS repository and model weights (unsloth/Spark-TTS-0.5B) for audio encoding/decoding. This model only contains the fine-tuned LLM weights

import torch
import re
import numpy as np
from unsloth import FastModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer

# Load model and tokenizer
model, tokenizer = FastModel.from_pretrained(
    "Sunbird/spark-tts-salt",
    max_seq_length=2048,
    dtype=torch.float32
)
FastModel.for_inference(model)

# Initialize audio tokenizer (requires Spark-TTS repo)
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

@torch.inference_mode()
def generate_speech(text: str, temperature: float = 0.8, 
                   top_k: int = 50, top_p: float = 1.0) -> np.ndarray:
    """
    Generate speech from text with speaker control.
    Format: "{speaker_id}: {text}" (e.g., "248: Oli otya?")
    """
    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>",
        text,
        "<|end_content|>",
        "<|start_global_token|>"
    ])
    
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    generated = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id
    )
    
    # Extract generated tokens
    output = tokenizer.batch_decode(
        generated[:, inputs.input_ids.shape[1]:], 
        skip_special_tokens=False
    )[0]
    
    # Parse semantic tokens
    semantic_ids = torch.tensor([
        int(m) for m in re.findall(r"<\|bicodec_semantic_(\d+)\|>", output)
    ]).long().unsqueeze(0)
    
    # Parse global tokens (speaker characteristics)
    global_ids = torch.tensor([
        int(m) for m in re.findall(r"<\|bicodec_global_(\d+)\|>", output)
    ]).long().unsqueeze(0).unsqueeze(0)
    
    # Decode to audio
    return audio_tokenizer.detokenize(global_ids, semantic_ids)

# Generate speech
audio = generate_speech("248: Oli otya? Nno nomuwoomera.")  # Luganda greeting

Voice Cloning

To clone a voice from reference audio:

# Extract speaker characteristics from reference
ref_global_ids, _ = audio_tokenizer.tokenize('reference.wav')

# Generate new speech with cloned voice
text = "243: This is new content in the cloned voice."
_, semantic_ids = generate_speech(text)  # Get semantic only
audio = audio_tokenizer.detokenize(ref_global_ids, semantic_ids)

Limitations

Speaker consistency: Zero-shot voice cloning quality varies with reference audio quality and length
Language mixing: Code-switching between languages may produce inconsistent prosody
Out-of-distribution speakers: Performance degrades for voices significantly different from training speakers
Audio length: Limited to ~8 seconds of audio context during training; longer utterances may truncate
Hardware: Requires CUDA for inference; float32 precision needed (no 4-bit/8-bit support verified)