Spark-TTS 0.5B — Hindi Fine-Tuned (v4)
Fine-tuned SparkAudio/Spark-TTS-0.5B for Hindi text-to-speech with zero-shot voice cloning.
Model Description
- Architecture: Qwen2.5-0.5B backbone + BiCodec audio tokenizer
- Fine-tuning: Full-parameter fine-tuning (not LoRA) using Unsloth on Hindi speech data
- Training data: 37,052 samples from IndicVoices_R (34,247 single-text + 2,805 voice-cloning pairs)
- Languages: Hindi (primary), English, Chinese (inherited from base model)
- Audio: 16kHz, mono
- Parameters: ~500M (LLM) + 597M (BiCodec)
Training Details
| Parameter | Value |
|---|---|
| Base model | SparkAudio/Spark-TTS-0.5B |
| Framework | Unsloth + HuggingFace Transformers |
| Precision | float32 (full fine-tuning) |
| Learning rate | 5e-5 |
| LR schedule | Cosine |
| Weight decay | 0.05 |
| Batch size | 4 (per device) |
| Gradient accumulation | 2 |
| Epochs | 4 (early stopped at epoch 1) |
| Best eval loss | 9.143 |
| Training samples | 37,052 |
| Validation samples | 1,012 |
How It Works
Spark-TTS uses a single-stream approach:
- Audio Tokenizer (BiCodec): Encodes reference audio into 32 global tokens (speaker identity) + variable-length semantic tokens (content)
- LLM (Qwen2.5-0.5B): Takes text + global tokens, generates semantic tokens
- Vocoder (BiCodec decoder): Converts semantic + global tokens back to audio waveform
For voice cloning, only the global tokens from the reference audio are used (the model generates all semantic tokens from scratch).
Usage
With the REST API Server
# Clone the repo
git clone https://github.com/user/tts
cd tts
# Download model
# Option A: Download from this HuggingFace repo
huggingface-cli download kapilkarda/tts-60db --local-dir pretrained_models/Spark-TTS-0.5B
# Install dependencies
pip install torch torchaudio fastapi uvicorn soundfile requests pydantic numpy
# Start API server
python api_server.py --backend pytorch --model_dir pretrained_models/Spark-TTS-0.5B --port 9090
# Synthesize
curl -X POST http://localhost:9090/v1/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "नमस्ते, यह एक टेस्ट है।", "voice_id": "YOUR_VOICE_ID"}' \
-o output.wav
With Triton TRT-LLM (Production)
For 10x faster inference using TensorRT-LLM:
# Start Triton server
cd runtime/triton_trtllm
docker compose -f docker-compose.prod.yml up -d
# Start API with Triton backend
python api_server.py --backend triton --triton-url localhost:8000 --port 9090
Direct Inference with Python
import torch
import soundfile as sf
from cli.SparkTTS import SparkTTS
model = SparkTTS(model_dir="pretrained_models/Spark-TTS-0.5B", device=torch.device("cuda:0"))
wav = model.inference(
text="नमस्ते, यह एक टेस्ट है।",
prompt_speech_path="reference.wav",
prompt_text=None, # Important: must be None for fine-tuned models
temperature=0.8,
top_k=50,
top_p=0.95,
)
sf.write("output.wav", wav.cpu().numpy().squeeze(), 16000)
Important: Set
prompt_text=Nonewhen using this fine-tuned model. The prompt_text continuation path is not compatible with fine-tuned weights.
Performance Benchmarks
Tested on NVIDIA A100 80GB:
| Backend | Mode | RTF | Notes |
|---|---|---|---|
| PyTorch | Single | 1.70 | Direct GPU inference |
| Triton TRT-LLM | Single | 0.22 | TensorRT optimized |
| Triton TRT-LLM | Concurrent x4 | 0.074 | Production recommended |
Model Files
.
├── config.yaml # Audio config (16kHz, segment params)
├── BiCodec/
│ ├── config.yaml
│ └── model.safetensors # Audio tokenizer + vocoder (597MB)
├── LLM/
│ ├── config.json # Qwen2.5-0.5B config
│ ├── model.safetensors # Fine-tuned LLM weights (1.9GB)
│ ├── tokenizer.json # Tokenizer with BiCodec special tokens
│ ├── tokenizer_config.json
│ ├── special_tokens_map.json
│ ├── added_tokens.json
│ ├── merges.txt
│ ├── vocab.json
│ └── generation_config.json
├── wav2vec2-large-xlsr-53/ # Feature extractor for audio tokenizer
│ └── ...
└── src/ # Model source code
Limitations
- Voice cloning quality depends on reference audio quality (5-15s clean speech recommended)
- Gender/pitch/speed controlled generation uses the base model path (not fine-tuned)
- Romanized Hindi input may produce better results than Devanagari for some phrases
- The
prompt_textcontinuation inference path is not supported — always useprompt_text=None
Citation
Based on Spark-TTS:
@article{spark-tts-2025,
title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
year={2025}
}
License
Apache 2.0 (same as base model)
- Downloads last month
- 1
Model tree for kapilkarda/tts-spark
Base model
SparkAudio/Spark-TTS-0.5B