---
base_model: snorbyte/snorTTS-Indic-v0
tags:
- text-to-speech
- tts
- transformers
- unsloth
- llama
- audio
- speech-synthesis
license: apache-2.0
language:
- hi
- gu
- mr
- pa
- bn
- te
- kn
- ml
- ta
---
# snorTTS-Indic-v0
snorTTS-Indic-v0 is a multilingual Indic Text-to-Speech (TTS) model capable of generating speech in nine Indic languages: Hindi, Tamil, Telugu, Marathi, Kannada, Malayalam, Punjabi, Gujarati, and Bengali.
👉 [Read the full blog: *Train a SoTA Multilingual Indic Text-to-Speech (TTS)*](https://snorbyte.com/blog/train-sota-multilingual-indic-tts) to learn how we built it.
👉 [Try out the model in our playground](https://snorbyte.com/snortts-indic-v0).
All code, datasets, and models—both base and fine-tuned—used in this work are available below for anyone to use and build upon.
## Capabilities
- TTS
- Voice-Cloning
- Code Switching
- Cross-lingual Voice Cloning (Multilingual Voice Transfer)
## Model Overview
| Item | Details |
|------------------------|----------------------------------------------------------------------------------------------------------------------------|
| **Architecture** | LLaMA-3.2-3B |
| **Base model** | `canopylabs/3b-hi-pretrain-research_release` |
| **Audio codec** | SNAC @ 24 kHz, 3 codebooks (12,288 new tokens) |
| **Languages** | Hindi, Gujarati, Marathi, Punjabi, Bengali, Telugu, Kannada, Malayalam, Tamil |
## Training
For details about the training and dataset, please refer to [*Train a SoTA Multilingual Indic Text-to-Speech (TTS)*](https://snorbyte.com/blog/train-sota-multilingual-indic-tts).
You can find the training script (`train_orepheus.py`) in this repository. It is a single, self-contained script for fine-tuning the base model.
👉 Dataset used for training: [snorbyte/indic-tts-sample-snac-encoded](https://huggingface.co/datasets/snorbyte/indic-tts-sample-snac-encoded)
## Inference
👉 To host in Modal: Check the ```modal``` folder
- Install necessary libraries for linux
```bash
sudo apt update
```
```bash
sudo apt install -y sox libsox-dev
```
- Use Python 3.10
- If you already have torch installed, uninstall it. Let unsloth take care of it.
```bash
pip uninstall -y torch torchaudio
```
- Install necessary packages
```bash
pip install unsloth loguru snac deepfilternet pydub soundfile librosa torchaudio
```
```python
from unsloth import FastLanguageModel
from snac import SNAC
import soundfile as sf
import numpy as np
from loguru import logger
from df.enhance import init_df, enhance, save_audio
import torch
import librosa
import torchaudio
import os
#Name of the model
MODEL_NAME = 'snorbyte/snorTTS-Indic-v0'
MAX_SEQ_LENGTH = 4096
HUGGINGFACE_TOKEN = "" # ! Add your hugging face token
# Load the model and tokenizer.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
# load_in_4bit=True,
max_seq_length=MAX_SEQ_LENGTH,
token=HUGGINGFACE_TOKEN,
)
logger.success(f"Loaded model: {MODEL_NAME}")
# Load the end of speech token for the tokenizer.
tokeniser_length = 128256
end_of_speech_id = tokeniser_length + 2
pad_token_id = tokeniser_length + 7
audio_start_id = tokeniser_length + 10
pad_token = tokenizer.decode([pad_token_id])
logger.success("Load special tokens for the tokenizer.")
# Wrap Model for Inference
FastLanguageModel.for_inference(model)
logger.success(f"{MODEL_NAME} is ready for inference.")
# Set the padding token and padding side.
tokenizer.pad_token = pad_token
tokenizer.padding_side = "left"
logger.success("Set padding token and padding side for the tokenizer.")
# Load the SNAC model for audio decoding.
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
logger.success("Loaded SNAC model for audio decoding.")
# Load DeepFilter for optional post processing
df_model, df_state, _ = init_df()
# Function to generate audio
def generate_audio(
row, model, tokenizer, user=False, temperature=0.4, top_p=0.9, repetition_penalty=1.05
):
try:
if user:
prompt = row["eval_text_user"]
else:
prompt = row["eval_text_no_user"]
inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
max_tokens = MAX_SEQ_LENGTH - inputs.input_ids.shape[1]
output = model.generate(
input_ids=inputs.input_ids.to("cuda"),
attention_mask=inputs.attention_mask.to("cuda"),
max_new_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
eos_token_id=end_of_speech_id,
)
audio_ids = []
for id in output[0]:
if id >= audio_start_id:
audio_ids.append(id.item())
clean_audio_ids = []
for i in range((len(audio_ids) + 1) // 7):
for j in range(7):
clean_audio_ids += [audio_ids[7 * i + j] - audio_start_id]
codes = [[], [], []]
for i in range((len(clean_audio_ids) + 1) // 7):
codes[0].append(clean_audio_ids[7 * i])
codes[1].append(clean_audio_ids[7 * i + 1] - 4096)
codes[2].append(clean_audio_ids[7 * i + 2] - (2 * 4096))
codes[2].append(clean_audio_ids[7 * i + 3] - (3 * 4096))
codes[1].append(clean_audio_ids[7 * i + 4] - (4 * 4096))
codes[2].append(clean_audio_ids[7 * i + 5] - (5 * 4096))
codes[2].append(clean_audio_ids[7 * i + 6] - (6 * 4096))
codes = [
torch.tensor(codes[0]).unsqueeze(0),
torch.tensor(codes[1]).unsqueeze(0),
torch.tensor(codes[2]).unsqueeze(0),
]
try:
audio = snac_model.decode(codes)
except Exception as e:
logger.error(f"Error decoding audio: {e}")
return None
return audio.detach().squeeze().to("cpu").numpy()
except Exception as e:
logger.error(f"Error decoding audio: {e}")
return None
# Run inference.
# * Please refer to the training script to create prompt from SNAC tokens.
row = {
"eval_text_user": f"<|begin_of_text|>kannada142: ಅಯ್ಯಯ್ಯೋ... Whitefield ಗೆ reach ಆಗೋಕೆ almost 10 hours ಆಯ್ತು you know... traffic was so terrible today <|eot_id|>"
}
eval_sample = generate_audio(row, model, tokenizer, True)
if eval_sample is None:
logger.error("Failed to generate audio for evaluation sample.")
else:
logger.success("Audio Generated. Post Processing Started")
## post-processing settings
filename = "eval.wav"
speed = 1.05 #add speed up according to speaker
denoise = False #denoise if you want
output = eval_sample.astype(np.float32)
#speed up
if abs(speed - 1.0) > 1e-4:
output_t = torch.from_numpy(output).unsqueeze(0)
output_speed, _ = torchaudio.sox_effects.apply_effects_tensor(output_t, 24_000, effects=[["tempo", f"{speed}"]])
output = output_speed.squeeze(0).cpu().numpy()
#denoise
if denoise:
resampled_48k = librosa.resample(output, orig_sr=24_000, target_sr=48_000)
resampled_48k = torch.from_numpy(resampled_48k).unsqueeze(0)
output_48k = enhance(df_model, df_state, resampled_48k)
output_48k = output_48k.squeeze(0).cpu().numpy()
output = librosa.resample(output_48k, orig_sr=48_000, target_sr=24_000)
logger.success("Saving Final Output...")
#save
sf.write(filename, output, 24_000)
logger.success(f"Generated and saved evaluation sample audio as {filename}.")
```
## Prompts
- **Standard**
```python
{
"eval_text_no_user": f"<|begin_of_text|>{utterance}<|eot_id|>"
}
```
```python
{
"eval_text_no_user": f"<|begin_of_text|>நிச்சயமா. ரோம் ல் இரவு நேரம் ரொம்ப அழகா இருக்கு—piazzaகள் சுத்துறதுக்கு நல்ல நேரம்.<|eot_id|>"
},
```
- **Speaker Specific**: (Recommended)
```python
{
"eval_text_user": f"<|begin_of_text|>{language}{speaker_id}: {utterance}<|eot_id|>"
}
```
> 📝 `utterance` can be in native language of the speaker, multi-lingual, or code-switched as well.
```python
{
"eval_text_user": f"<|begin_of_text|>hindi159: चलते रहो इस सफर में बिना रुके, क्योंकि मंज़िलें खुद राह दिखाने लगती हैं <|eot_id|>"
}
```
```python
{
"eval_text_user": f"<|begin_of_text|>bengali125: मुझे तो लगा वो आएगा, ஆனா அவன் வந்து full drama பண்ணிட்டான், আর শেষে আবার আমাকে দোষ দিচ্ছে <|eot_id|>"
}
```
### Speaker IDs
| Language | Speakers | Recommended Speedup |
|-----------|------------------|----------------------|
| Hindi | [159,49,43] | [1.05,1.1,1.1] |
| Tamil | [188,128,176] | [1.1,1.15,1.1] |
| Bengali | [125] | [1.1] |
| Malayalam | [189,124] | [1.1,1.1] |
| Kannada | [142,138,131,59] | [1.05,1.1,1.1,1.1] |
| Telugu | [69,133] | [1.1,1.1] |
| Punjabi | [191,67,201] | [1.08,1.06,1.1] |
| Gujarati | [62,190] | [1.15,1.25] |
| Marathi | [205,82,199,203] | [1.05,1.05,1.1,1.15] |
## Contact Us
👉 Mail: [founders@snorbyte.com](mailto:founders@snorbyte.com)
👉 Website: [https://snorbyte.com](https://snorbyte.com)
## Citation
BibTeX:
```bibtex
@misc{indictextaudio2025,
title={snorTTS-Indic-v0: Multilingual Indic TTS},
author={snorbyte},
year={2025},
howpublished={\url{snorbyte/snorTTS-Indic-v0}},
note={Apache-2.0}
}
```