Model Card for Model ID

Somya-IndicTTS Multilingual TTS Model

A multilingual text-to-speech model supporting 10 languages with high-quality speech synthesis capabilities.

Model Description

Orpheus is a multilingual TTS model based on Llama architecture, fine-tuned for text-to-speech generation across multiple Indic languages. The model generates high-quality speech audio at 24kHz using the SNAC audio codec.

Supported Languages

  • HI - Hindi
  • KN - Kannada
  • MR - Marathi
  • TE - Telugu
  • BN - Bengali
  • GU - Gujarati
  • MA - Maithili
  • MG - Magahi
  • BH - Bhojpuri
  • CH - Chhattisgarhi

Supported Speakers

  • M - Male speaker ([spk_M])
  • F - Female speaker ([spk_F])

Installation

pip install torch transformers soundfile librosa snac

Usage

Basic TTS Inference

import torch
import soundfile as sf
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC

# Constants
START_OF_TEXT = 128000
END_OF_TEXT = 128009
START_OF_SPEECH = 128257
END_OF_SPEECH = 128258
START_OF_HUMAN = 128259
END_OF_HUMAN = 128260
START_OF_AI = 128261
AUDIO_TOKENS_START = 128266
PAD_TOKEN = 128004

# Load model and tokenizer
model_path = "somyalab/Somya-IndicTTS"
device = "cuda" if torch.cuda.is_available() else "cpu"

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_path)

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
model.eval()

print("Loading SNAC decoder...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to(device)
snac_model.eval()

def decode_snac_tokens(audio_tokens):
    """Decode audio tokens to waveform"""
    codes = [t - AUDIO_TOKENS_START for t in audio_tokens]
    
    if len(codes) < 7:
        return None
    
    remainder = len(codes) % 7
    if remainder != 0:
        codes = codes[:-remainder]
    
    num_frames = len(codes) // 7
    l0, l1, l2 = [], [], []
    
    for i in range(num_frames):
        idx = i * 7
        l0.append(codes[idx])
        l1.extend([codes[idx+1] - 4096, codes[idx+4] - 16384])
        l2.extend([
            codes[idx+2] - 8192,
            codes[idx+3] - 12288,
            codes[idx+5] - 20480,
            codes[idx+6] - 24576
        ])
    
    t0 = torch.tensor(l0, dtype=torch.long).unsqueeze(0).to(device)
    t1 = torch.tensor(l1, dtype=torch.long).unsqueeze(0).to(device)
    t2 = torch.tensor(l2, dtype=torch.long).unsqueeze(0).to(device)
    
    with torch.no_grad():
        audio = snac_model.decode([t0, t1, t2])
    
    return audio.cpu().squeeze().numpy()

def generate_tts(text, language="HI", speaker="F", output_path="output.wav"):
    """Generate TTS audio"""
    # Format prompt with language and speaker tags
    prompt = f"[{language}] [spk_{speaker}] {text}"
    
    # Tokenize
    text_ids = tokenizer.encode(prompt, add_special_tokens=True)
    text_ids.append(END_OF_TEXT)
    
    # Build input sequence
    input_ids = [START_OF_HUMAN] + text_ids + [END_OF_HUMAN] + [START_OF_AI] + [START_OF_SPEECH]
    input_tensor = torch.tensor([input_ids]).to(device)
    
    # Generate
    print(f"Generating audio for: {text[:50]}...")
    with torch.no_grad():
        output = model.generate(
            input_tensor,
            max_new_tokens=2048,
            temperature=0.7,
            repetition_penalty=1.1,
            top_p=0.95,
            top_k=100,
            do_sample=True,
            pad_token_id=PAD_TOKEN,
            eos_token_id=END_OF_SPEECH
        )
    
    # Extract audio tokens
    generated = output[0].cpu().tolist()
    
    try:
        speech_start = generated.index(START_OF_SPEECH) + 1
        if END_OF_SPEECH in generated[speech_start:]:
            speech_end = generated.index(END_OF_SPEECH, speech_start)
        else:
            speech_end = len(generated)
        
        audio_tokens = [t for t in generated[speech_start:speech_end] if t >= AUDIO_TOKENS_START]
    except ValueError:
        print("Failed to find speech tokens in output.")
        return False
    
    # Decode audio
    waveform = decode_snac_tokens(audio_tokens)
    if waveform is None:
        return False
    
    # Save audio
    sf.write(output_path, waveform, 24000)
    print(f"✓ Saved to: {output_path}")
    return True

# Example usage
generate_tts(
    text="सोम्या लैब में आपका ढेर सारा स्वागत है!",
    language="HI",
    speaker="F",
    output_path="output_hindi.wav"
)

Example with Different Languages

# Hindi (Female)
generate_tts("नमस्ते, यह एक परीक्षण है।", language="HI", speaker="F", output_path="hi_f.wav")

# Kannada (Male)
generate_tts("ನಮಸ್ಕಾರ, ಇದು ಒಂದು ಪರೀಕ್ಷೆಯಾಗಿದೆ।", language="KN", speaker="M", output_path="kn_m.wav")

# Bengali (Male)
generate_tts("নমস্কার, এটি একটি পরীক্ষা।", language="BN", speaker="M", output_path="bn_m.wav")

Note

Language tags like [EN], [HI] for speaker variation can be used for different languages too. The model supports cross-language synthesis, meaning you can use any language tag with any text to achieve different speaker characteristics and accents.

Generation Parameters

Recommended settings for TTS generation:

  • temperature: 0.6-0.7 (controls randomness)
  • top_p: 0.9-0.95 (nucleus sampling)
  • top_k: 50-100 (top-k sampling)
  • repetition_penalty: 1.1-1.2 (reduces repetition)
  • max_new_tokens: 2048 (maximum audio tokens)

Citation

If you use this model, please cite:

@misc{somyalab/Somya-IndicTTS,
  title={Somya-IndicTTS Multilingual TTS Model},
  author={Vedu023},
  year={2025},
  publisher={Hugging Face}
}
Downloads last month
9
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for somyalab/Somya-IndicTTS

Quantizations
1 model