Spark-TTS 0.5B — Hindi Fine-Tuned (v4)

Fine-tuned SparkAudio/Spark-TTS-0.5B for Hindi text-to-speech with zero-shot voice cloning.

Model Description

  • Architecture: Qwen2.5-0.5B backbone + BiCodec audio tokenizer
  • Fine-tuning: Full-parameter fine-tuning (not LoRA) using Unsloth on Hindi speech data
  • Training data: 37,052 samples from IndicVoices_R (34,247 single-text + 2,805 voice-cloning pairs)
  • Languages: Hindi (primary), English, Chinese (inherited from base model)
  • Audio: 16kHz, mono
  • Parameters: ~500M (LLM) + 597M (BiCodec)

Training Details

Parameter Value
Base model SparkAudio/Spark-TTS-0.5B
Framework Unsloth + HuggingFace Transformers
Precision float32 (full fine-tuning)
Learning rate 5e-5
LR schedule Cosine
Weight decay 0.05
Batch size 4 (per device)
Gradient accumulation 2
Epochs 4 (early stopped at epoch 1)
Best eval loss 9.143
Training samples 37,052
Validation samples 1,012

How It Works

Spark-TTS uses a single-stream approach:

  1. Audio Tokenizer (BiCodec): Encodes reference audio into 32 global tokens (speaker identity) + variable-length semantic tokens (content)
  2. LLM (Qwen2.5-0.5B): Takes text + global tokens, generates semantic tokens
  3. Vocoder (BiCodec decoder): Converts semantic + global tokens back to audio waveform

For voice cloning, only the global tokens from the reference audio are used (the model generates all semantic tokens from scratch).

Usage

With the REST API Server

# Clone the repo
git clone https://github.com/user/tts
cd tts

# Download model
# Option A: Download from this HuggingFace repo
huggingface-cli download kapilkarda/tts-60db --local-dir pretrained_models/Spark-TTS-0.5B

# Install dependencies
pip install torch torchaudio fastapi uvicorn soundfile requests pydantic numpy

# Start API server
python api_server.py --backend pytorch --model_dir pretrained_models/Spark-TTS-0.5B --port 9090

# Synthesize
curl -X POST http://localhost:9090/v1/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "नमस्ते, यह एक टेस्ट है।", "voice_id": "YOUR_VOICE_ID"}' \
  -o output.wav

With Triton TRT-LLM (Production)

For 10x faster inference using TensorRT-LLM:

# Start Triton server
cd runtime/triton_trtllm
docker compose -f docker-compose.prod.yml up -d

# Start API with Triton backend
python api_server.py --backend triton --triton-url localhost:8000 --port 9090

Direct Inference with Python

import torch
import soundfile as sf
from cli.SparkTTS import SparkTTS

model = SparkTTS(model_dir="pretrained_models/Spark-TTS-0.5B", device=torch.device("cuda:0"))

wav = model.inference(
    text="नमस्ते, यह एक टेस्ट है।",
    prompt_speech_path="reference.wav",
    prompt_text=None,  # Important: must be None for fine-tuned models
    temperature=0.8,
    top_k=50,
    top_p=0.95,
)

sf.write("output.wav", wav.cpu().numpy().squeeze(), 16000)

Important: Set prompt_text=None when using this fine-tuned model. The prompt_text continuation path is not compatible with fine-tuned weights.

Performance Benchmarks

Tested on NVIDIA A100 80GB:

Backend Mode RTF Notes
PyTorch Single 1.70 Direct GPU inference
Triton TRT-LLM Single 0.22 TensorRT optimized
Triton TRT-LLM Concurrent x4 0.074 Production recommended

Model Files

.
├── config.yaml              # Audio config (16kHz, segment params)
├── BiCodec/
│   ├── config.yaml
│   └── model.safetensors    # Audio tokenizer + vocoder (597MB)
├── LLM/
│   ├── config.json          # Qwen2.5-0.5B config
│   ├── model.safetensors    # Fine-tuned LLM weights (1.9GB)
│   ├── tokenizer.json       # Tokenizer with BiCodec special tokens
│   ├── tokenizer_config.json
│   ├── special_tokens_map.json
│   ├── added_tokens.json
│   ├── merges.txt
│   ├── vocab.json
│   └── generation_config.json
├── wav2vec2-large-xlsr-53/  # Feature extractor for audio tokenizer
│   └── ...
└── src/                     # Model source code

Limitations

  • Voice cloning quality depends on reference audio quality (5-15s clean speech recommended)
  • Gender/pitch/speed controlled generation uses the base model path (not fine-tuned)
  • Romanized Hindi input may produce better results than Devanagari for some phrases
  • The prompt_text continuation inference path is not supported — always use prompt_text=None

Citation

Based on Spark-TTS:

@article{spark-tts-2025,
  title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
  year={2025}
}

License

Apache 2.0 (same as base model)

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kapilkarda/tts-spark

Finetuned
(24)
this model