You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Spark-TTS for Ugandan Languages (SALT)

Text-to-speech model fine-tuned from Spark-TTS-0.5B for seven languages widely spoken in Uganda: Acholi, Ateso, English (Ugandan accent), Luganda, Lugbara, Runyankore, and Swahili.

Architecture

  • Base Model: Spark-TTS-0.5B (based on Qwen2.5-Instruct)
  • Audio Codec: BiCodec tokenizer with dual-token architecture
    • Global tokens: Speaker characteristics and prosody (controllable)
    • Semantic tokens: Linguistic content and phonetic structure
  • Input: Text with speaker ID prefix + reference audio for voice cloning
  • Output: Discrete audio tokens → 24kHz waveform via BiCodec detokenizer

Training

Fine-tuned on the SALT studio dataset across 7 languages with speaker-aware conditioning. Training used full fine-tuning (not LoRA) in float32 precision with the following data mixture:

Speaker ID Language Gender
241 Acholi Female
242 Ateso Female
243 Runyankore Female
245 Lugbara Female
246 Swahili Male
248 Luganda Female

Training Configuration:

  • Max sequence length: 2048 tokens
  • Learning rate: 2e-4
  • Optimizer: AdamW 8-bit
  • Epochs: 1 (~12k samples)
  • Batch size: 4 (gradient accumulation: 2)

Audio preprocessing included volume normalization, resampling to 24kHz, and Wav2Vec2 feature extraction (layers 11, 14, 16 averaged) for semantic tokenization.

Usage

Requirements

pip install unsloth transformers torch datasets soundfile librosa
pip install omegaconf einx einops torchaudio

# Clone Spark-TTS repository for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS && pip install -e .

The BiCodec tokenizer requires the original Spark-TTS repository and model weights (unsloth/Spark-TTS-0.5B) for audio encoding/decoding. This model only contains the fine-tuned LLM weights

import torch
import re
import numpy as np
from unsloth import FastModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer

# Load model and tokenizer
model, tokenizer = FastModel.from_pretrained(
    "Sunbird/spark-tts-salt",
    max_seq_length=2048,
    dtype=torch.float32
)
FastModel.for_inference(model)

# Initialize audio tokenizer (requires Spark-TTS repo)
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

@torch.inference_mode()
def generate_speech(text: str, temperature: float = 0.8, 
                   top_k: int = 50, top_p: float = 1.0) -> np.ndarray:
    """
    Generate speech from text with speaker control.
    Format: "{speaker_id}: {text}" (e.g., "248: Oli otya?")
    """
    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>",
        text,
        "<|end_content|>",
        "<|start_global_token|>"
    ])
    
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    generated = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id
    )
    
    # Extract generated tokens
    output = tokenizer.batch_decode(
        generated[:, inputs.input_ids.shape[1]:], 
        skip_special_tokens=False
    )[0]
    
    # Parse semantic tokens
    semantic_ids = torch.tensor([
        int(m) for m in re.findall(r"<\|bicodec_semantic_(\d+)\|>", output)
    ]).long().unsqueeze(0)
    
    # Parse global tokens (speaker characteristics)
    global_ids = torch.tensor([
        int(m) for m in re.findall(r"<\|bicodec_global_(\d+)\|>", output)
    ]).long().unsqueeze(0).unsqueeze(0)
    
    # Decode to audio
    return audio_tokenizer.detokenize(global_ids, semantic_ids)

# Generate speech
audio = generate_speech("248: Oli otya? Nno nomuwoomera.")  # Luganda greeting

Voice Cloning

To clone a voice from reference audio:

# Extract speaker characteristics from reference
ref_global_ids, _ = audio_tokenizer.tokenize('reference.wav')

# Generate new speech with cloned voice
text = "243: This is new content in the cloned voice."
_, semantic_ids = generate_speech(text)  # Get semantic only
audio = audio_tokenizer.detokenize(ref_global_ids, semantic_ids)

Limitations

  • Speaker consistency: Zero-shot voice cloning quality varies with reference audio quality and length
  • Language mixing: Code-switching between languages may produce inconsistent prosody
  • Out-of-distribution speakers: Performance degrades for voices significantly different from training speakers
  • Audio length: Limited to ~8 seconds of audio context during training; longer utterances may truncate
  • Hardware: Requires CUDA for inference; float32 precision needed (no 4-bit/8-bit support verified)

Model Details

  • Developed by: Sunbird AI
  • Model type: Causal language model for discrete audio token prediction
  • Parameters: 0.5B (active), full fine-tuning
  • License: MIT
  • Repository: https://github.com/SparkAudio/Spark-TTS
Downloads last month
207
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sunbird/spark-tts-salt

Finetuned
(2)
this model

Dataset used to train Sunbird/spark-tts-salt