Spark-Somya-TTS

Zero-shot voice cloning TTS model for Indic languages, fine-tuned from Spark-TTS-0.5B.

Supported Languages

  • Hindi (hi)
  • Kannada (kn)
  • Bengali (bn)
  • Gujarati (gu)
  • Telugu (te)
  • Marathi (mr)
  • Bhojpuri (bh)
  • Maithili (mai)
  • Maghahi (mag)
  • Bangali (bn)
  • chhattisgarhi (hne)

Quick Start

Installation

pip install torch transformers huggingface_hub unsloth soundfile librosa numpy

Download Model

from huggingface_hub import snapshot_download

model_dir = snapshot_download("somyalab/Spark_somya_TTS")

Inference

import torch
import numpy as np
import soundfile as sf
from unsloth import FastLanguageModel

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_dir,
    max_seq_length=2048,
    dtype=torch.bfloat16,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(model)

# Load audio tokenizer (BiCodec)
import sys
sys.path.insert(0, model_dir)
from sparktts.models.audio_tokenizer import BiCodecTokenizer

audio_tokenizer = BiCodecTokenizer(model_dir, "cuda")

# Reference audio for voice cloning
import librosa
ref_audio, ref_sr = librosa.load("reference_voice.wav", sr=None)
ref_global_tokens, _ = audio_tokenizer.tokenize_audio(ref_audio, ref_sr)

# Generate speech
text = "नमस्ते, यह एक परीक्षण है।"

prompt = "".join([
    "<|task_tts|>",
    "<|start_content|>",
    text,
    "<|end_content|>",
    "<|start_global_token|>",
    ref_global_tokens,
    "<|end_global_token|>",
    "<|start_semantic_token|>",
])

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.7,
)

# Decode to audio
generated_ids = outputs[:, inputs.input_ids.shape[1]:]
generated_tokens = tokenizer.convert_ids_to_tokens(generated_ids[0].tolist())

# Extract semantic token IDs
semantic_ids = []
for t in generated_tokens:
    if t.startswith("<|bicodec_semantic_") and t.endswith("|>"):
        semantic_ids.append(int(t[18:-2]))

# Detokenize to waveform
import re
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", ref_global_tokens)
global_ids = torch.tensor([int(t) for t in global_matches]).unsqueeze(0).unsqueeze(0)
semantic_ids = torch.tensor(semantic_ids).unsqueeze(0)

wav = audio_tokenizer.detokenize(
    global_ids.to("cuda").squeeze(0),
    semantic_ids.to("cuda"),
)

sf.write("output.wav", wav, 16000)

Model Architecture

  • Base: Qwen2ForCausalLM (0.5B parameters)
  • Fine-tuned for Indic languages with extended tokenizer
  • Uses BiCodec for audio tokenization/detokenization

Citation

If you use this model, please cite:

@misc{spark-somya-tts,
  title={Spark-Somya-TTS},
  author={Somya Lab},
  year={2025},
  url={https://huggingface.co/somyalab/Spark_somya_TTS}
}

License

Apache 2.0

Downloads last month
33
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support