Spark-Somya-TTS
Zero-shot voice cloning TTS model for Indic languages, fine-tuned from Spark-TTS-0.5B.
Supported Languages
- Hindi (hi)
- Kannada (kn)
- Bengali (bn)
- Gujarati (gu)
- Telugu (te)
- Marathi (mr)
- Bhojpuri (bh)
- Maithili (mai)
- Maghahi (mag)
- Bangali (bn)
- chhattisgarhi (hne)
Quick Start
Installation
pip install torch transformers huggingface_hub unsloth soundfile librosa numpy
Download Model
from huggingface_hub import snapshot_download
model_dir = snapshot_download("somyalab/Spark_somya_TTS")
Inference
import torch
import numpy as np
import soundfile as sf
from unsloth import FastLanguageModel
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_dir,
max_seq_length=2048,
dtype=torch.bfloat16,
load_in_4bit=False,
)
FastLanguageModel.for_inference(model)
# Load audio tokenizer (BiCodec)
import sys
sys.path.insert(0, model_dir)
from sparktts.models.audio_tokenizer import BiCodecTokenizer
audio_tokenizer = BiCodecTokenizer(model_dir, "cuda")
# Reference audio for voice cloning
import librosa
ref_audio, ref_sr = librosa.load("reference_voice.wav", sr=None)
ref_global_tokens, _ = audio_tokenizer.tokenize_audio(ref_audio, ref_sr)
# Generate speech
text = "नमस्ते, यह एक परीक्षण है।"
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>",
ref_global_tokens,
"<|end_global_token|>",
"<|start_semantic_token|>",
])
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
temperature=0.7,
)
# Decode to audio
generated_ids = outputs[:, inputs.input_ids.shape[1]:]
generated_tokens = tokenizer.convert_ids_to_tokens(generated_ids[0].tolist())
# Extract semantic token IDs
semantic_ids = []
for t in generated_tokens:
if t.startswith("<|bicodec_semantic_") and t.endswith("|>"):
semantic_ids.append(int(t[18:-2]))
# Detokenize to waveform
import re
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", ref_global_tokens)
global_ids = torch.tensor([int(t) for t in global_matches]).unsqueeze(0).unsqueeze(0)
semantic_ids = torch.tensor(semantic_ids).unsqueeze(0)
wav = audio_tokenizer.detokenize(
global_ids.to("cuda").squeeze(0),
semantic_ids.to("cuda"),
)
sf.write("output.wav", wav, 16000)
Model Architecture
- Base: Qwen2ForCausalLM (0.5B parameters)
- Fine-tuned for Indic languages with extended tokenizer
- Uses BiCodec for audio tokenization/detokenization
Citation
If you use this model, please cite:
@misc{spark-somya-tts,
title={Spark-Somya-TTS},
author={Somya Lab},
year={2025},
url={https://huggingface.co/somyalab/Spark_somya_TTS}
}
License
Apache 2.0
- Downloads last month
- 33