Spark-TTS 0.5B — Hindi Fine-Tuned (v4)

Fine-tuned SparkAudio/Spark-TTS-0.5B for Hindi text-to-speech with zero-shot voice cloning.

Model Description

Architecture: Qwen2.5-0.5B backbone + BiCodec audio tokenizer
Fine-tuning: Full-parameter fine-tuning (not LoRA) using Unsloth on Hindi speech data
Training data: 37,052 samples from IndicVoices_R (34,247 single-text + 2,805 voice-cloning pairs)
Languages: Hindi (primary), English, Chinese (inherited from base model)
Audio: 16kHz, mono
Parameters: ~500M (LLM) + 597M (BiCodec)

Training Details

Parameter	Value
Base model	SparkAudio/Spark-TTS-0.5B
Framework	Unsloth + HuggingFace Transformers
Precision	float32 (full fine-tuning)
Learning rate	5e-5
LR schedule	Cosine
Weight decay	0.05
Batch size	4 (per device)
Gradient accumulation	2
Epochs	4 (early stopped at epoch 1)
Best eval loss	9.143
Training samples	37,052
Validation samples	1,012

How It Works

Spark-TTS uses a single-stream approach:

Audio Tokenizer (BiCodec): Encodes reference audio into 32 global tokens (speaker identity) + variable-length semantic tokens (content)
LLM (Qwen2.5-0.5B): Takes text + global tokens, generates semantic tokens
Vocoder (BiCodec decoder): Converts semantic + global tokens back to audio waveform

For voice cloning, only the global tokens from the reference audio are used (the model generates all semantic tokens from scratch).

Usage

With the REST API Server

# Clone the repo
git clone https://github.com/user/tts
cd tts

# Download model
# Option A: Download from this HuggingFace repo
huggingface-cli download kapilkarda/tts-60db --local-dir pretrained_models/Spark-TTS-0.5B

# Install dependencies
pip install torch torchaudio fastapi uvicorn soundfile requests pydantic numpy

# Start API server
python api_server.py --backend pytorch --model_dir pretrained_models/Spark-TTS-0.5B --port 9090

# Synthesize
curl -X POST http://localhost:9090/v1/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "नमस्ते, यह एक टेस्ट है।", "voice_id": "YOUR_VOICE_ID"}' \
  -o output.wav

With Triton TRT-LLM (Production)

For 10x faster inference using TensorRT-LLM:

# Start Triton server
cd runtime/triton_trtllm
docker compose -f docker-compose.prod.yml up -d

# Start API with Triton backend
python api_server.py --backend triton --triton-url localhost:8000 --port 9090

Direct Inference with Python

import torch
import soundfile as sf
from cli.SparkTTS import SparkTTS

model = SparkTTS(model_dir="pretrained_models/Spark-TTS-0.5B", device=torch.device("cuda:0"))

wav = model.inference(
    text="नमस्ते, यह एक टेस्ट है।",
    prompt_speech_path="reference.wav",
    prompt_text=None,  # Important: must be None for fine-tuned models
    temperature=0.8,
    top_k=50,
    top_p=0.95,
)

sf.write("output.wav", wav.cpu().numpy().squeeze(), 16000)

Important: Set prompt_text=None when using this fine-tuned model. The prompt_text continuation path is not compatible with fine-tuned weights.

Performance Benchmarks

Tested on NVIDIA A100 80GB:

Backend	Mode	RTF	Notes
PyTorch	Single	1.70	Direct GPU inference
Triton TRT-LLM	Single	0.22	TensorRT optimized
Triton TRT-LLM	Concurrent x4	0.074	Production recommended

Model Files

.
├── config.yaml              # Audio config (16kHz, segment params)
├── BiCodec/
│   ├── config.yaml
│   └── model.safetensors    # Audio tokenizer + vocoder (597MB)
├── LLM/
│   ├── config.json          # Qwen2.5-0.5B config
│   ├── model.safetensors    # Fine-tuned LLM weights (1.9GB)
│   ├── tokenizer.json       # Tokenizer with BiCodec special tokens
│   ├── tokenizer_config.json
│   ├── special_tokens_map.json
│   ├── added_tokens.json
│   ├── merges.txt
│   ├── vocab.json
│   └── generation_config.json
├── wav2vec2-large-xlsr-53/  # Feature extractor for audio tokenizer
│   └── ...
└── src/                     # Model source code

Limitations

Voice cloning quality depends on reference audio quality (5-15s clean speech recommended)
Gender/pitch/speed controlled generation uses the base model path (not fine-tuned)
Romanized Hindi input may produce better results than Devanagari for some phrases
The prompt_text continuation inference path is not supported — always use prompt_text=None

Citation

Based on Spark-TTS:

@article{spark-tts-2025,
  title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
  year={2025}
}

License

Apache 2.0 (same as base model)

Downloads last month: 12

Model tree for kapilkarda/tts-spark

Base model

SparkAudio/Spark-TTS-0.5B

Finetuned

(25)

this model