Model Card for Model ID
Somya-IndicTTS Multilingual TTS Model
A multilingual text-to-speech model supporting 10 languages with high-quality speech synthesis capabilities.
Model Description
Orpheus is a multilingual TTS model based on Llama architecture, fine-tuned for text-to-speech generation across multiple Indic languages. The model generates high-quality speech audio at 24kHz using the SNAC audio codec.
Supported Languages
- HI - Hindi
- KN - Kannada
- MR - Marathi
- TE - Telugu
- BN - Bengali
- GU - Gujarati
- MA - Maithili
- MG - Magahi
- BH - Bhojpuri
- CH - Chhattisgarhi
Supported Speakers
- M - Male speaker (
[spk_M]) - F - Female speaker (
[spk_F])
Installation
pip install torch transformers soundfile librosa snac
Usage
Basic TTS Inference
import torch
import soundfile as sf
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
# Constants
START_OF_TEXT = 128000
END_OF_TEXT = 128009
START_OF_SPEECH = 128257
END_OF_SPEECH = 128258
START_OF_HUMAN = 128259
END_OF_HUMAN = 128260
START_OF_AI = 128261
AUDIO_TOKENS_START = 128266
PAD_TOKEN = 128004
# Load model and tokenizer
model_path = "somyalab/Somya-IndicTTS"
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
model.eval()
print("Loading SNAC decoder...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to(device)
snac_model.eval()
def decode_snac_tokens(audio_tokens):
"""Decode audio tokens to waveform"""
codes = [t - AUDIO_TOKENS_START for t in audio_tokens]
if len(codes) < 7:
return None
remainder = len(codes) % 7
if remainder != 0:
codes = codes[:-remainder]
num_frames = len(codes) // 7
l0, l1, l2 = [], [], []
for i in range(num_frames):
idx = i * 7
l0.append(codes[idx])
l1.extend([codes[idx+1] - 4096, codes[idx+4] - 16384])
l2.extend([
codes[idx+2] - 8192,
codes[idx+3] - 12288,
codes[idx+5] - 20480,
codes[idx+6] - 24576
])
t0 = torch.tensor(l0, dtype=torch.long).unsqueeze(0).to(device)
t1 = torch.tensor(l1, dtype=torch.long).unsqueeze(0).to(device)
t2 = torch.tensor(l2, dtype=torch.long).unsqueeze(0).to(device)
with torch.no_grad():
audio = snac_model.decode([t0, t1, t2])
return audio.cpu().squeeze().numpy()
def generate_tts(text, language="HI", speaker="F", output_path="output.wav"):
"""Generate TTS audio"""
# Format prompt with language and speaker tags
prompt = f"[{language}] [spk_{speaker}] {text}"
# Tokenize
text_ids = tokenizer.encode(prompt, add_special_tokens=True)
text_ids.append(END_OF_TEXT)
# Build input sequence
input_ids = [START_OF_HUMAN] + text_ids + [END_OF_HUMAN] + [START_OF_AI] + [START_OF_SPEECH]
input_tensor = torch.tensor([input_ids]).to(device)
# Generate
print(f"Generating audio for: {text[:50]}...")
with torch.no_grad():
output = model.generate(
input_tensor,
max_new_tokens=2048,
temperature=0.7,
repetition_penalty=1.1,
top_p=0.95,
top_k=100,
do_sample=True,
pad_token_id=PAD_TOKEN,
eos_token_id=END_OF_SPEECH
)
# Extract audio tokens
generated = output[0].cpu().tolist()
try:
speech_start = generated.index(START_OF_SPEECH) + 1
if END_OF_SPEECH in generated[speech_start:]:
speech_end = generated.index(END_OF_SPEECH, speech_start)
else:
speech_end = len(generated)
audio_tokens = [t for t in generated[speech_start:speech_end] if t >= AUDIO_TOKENS_START]
except ValueError:
print("Failed to find speech tokens in output.")
return False
# Decode audio
waveform = decode_snac_tokens(audio_tokens)
if waveform is None:
return False
# Save audio
sf.write(output_path, waveform, 24000)
print(f"✓ Saved to: {output_path}")
return True
# Example usage
generate_tts(
text="सोम्या लैब में आपका ढेर सारा स्वागत है!",
language="HI",
speaker="F",
output_path="output_hindi.wav"
)
Example with Different Languages
# Hindi (Female)
generate_tts("नमस्ते, यह एक परीक्षण है।", language="HI", speaker="F", output_path="hi_f.wav")
# Kannada (Male)
generate_tts("ನಮಸ್ಕಾರ, ಇದು ಒಂದು ಪರೀಕ್ಷೆಯಾಗಿದೆ।", language="KN", speaker="M", output_path="kn_m.wav")
# Bengali (Male)
generate_tts("নমস্কার, এটি একটি পরীক্ষা।", language="BN", speaker="M", output_path="bn_m.wav")
Note
Language tags like [EN], [HI] for speaker variation can be used for different languages too. The model supports cross-language synthesis, meaning you can use any language tag with any text to achieve different speaker characteristics and accents.
Generation Parameters
Recommended settings for TTS generation:
- temperature: 0.6-0.7 (controls randomness)
- top_p: 0.9-0.95 (nucleus sampling)
- top_k: 50-100 (top-k sampling)
- repetition_penalty: 1.1-1.2 (reduces repetition)
- max_new_tokens: 2048 (maximum audio tokens)
Citation
If you use this model, please cite:
@misc{somyalab/Somya-IndicTTS,
title={Somya-IndicTTS Multilingual TTS Model},
author={Vedu023},
year={2025},
publisher={Hugging Face}
}
- Downloads last month
- 9
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support