Sanskrit_TTS_v2 / README.md
R910's picture
Rename Readme.md to README.md
8102e34 verified
---
base_model: unsloth/orpheus-3b-0.1-ft
model_type: llama
library_name: transformers
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- sanskrit
- audio-generation
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- fine-tuned
- devanagari
language:
- en
- sa
datasets:
- IIT-Madras-IndicTTS
metrics: null
widget:
- text: यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।
example_title: Bhagavad Gita 4.7
- text: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।
example_title: Bhagavad Gita 2.47
- text: विद्या ददाति विनयं
example_title: Subhashita
- text: तमसो मा ज्योतिर्गमय।
example_title: Brihadaranyaka Upanishad
model-index:
- name: Sanskrit TTS v2
results:
- task:
type: text-to-speech
name: Text-to-Speech
dataset:
type: IIT-Madras-IndicTTS
name: IIT Madras IndicTTS Sanskrit (Mono Female)
metrics:
- type: audio_duration
name: Training Audio Duration
value: 10.93 hrs
---
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Rstar-910/SamskritaBharati/blob/main/Sanskrit_TTS_v2.ipynb)
# Sanskrit Text-to-Speech Model
## Model Overview
**Model ID:** R910/Sanskrit_TTS_v2
**Base Model:** unsloth/orpheus-3b-0.1-ft
**Language:** English
**Primary Dataset:** [IIT Madras IndicTTS Sanskrit Database](https://www.iitm.ac.in/donlab/indictts/database)
**Voice:** Mono Female
**Training Audio Duration:** 10.93 hours
This fine-tuned Language Model (LLaMA) specializes in Sanskrit text-to-speech synthesis and has been optimized using Unsloth and Hugging Face's TRL library for enhanced training efficiency.
## Training Data
The model was trained on the **Sanskrit speech corpus** from the [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database), using a **mono female voice** recording with a total audio duration of **10.93 hours**. The IndicTTS project, developed by the Speech and Language Technology Group at IIT Madras, provides high-quality speech corpora for Indic languages.
## Installation Requirements
### Environment Detection and Base Setup
```bash
# Environment detection
python3 -c "
import os
print('colab' if 'COLAB_' in ''.join(os.environ.keys()) else 'local')
"
# Install core dependencies
pip install snac
```
### Google Colab Installation
For Google Colab environments, execute the following installation sequence:
```bash
# Install Colab-specific dependencies
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf 'datasets>=3.4.1,<4.0.0' huggingface_hub hf_transfer
pip install --no-deps unsloth
# Environment cleanup (recommended for clean installation)
pip uninstall torch torchvision torchaudio unsloth unsloth_zoo transformers -y
pip cache purge
# Install PyTorch with CUDA 12.1 support
pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121
# Install latest Unsloth from source
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# Additional dependencies
pip install librosa
pip install -U datasets
```
## Implementation Guide
### Complete Implementation Code
```python
import gradio as gr
import torch
from unsloth import FastLanguageModel
from IPython.display import display, Audio
import numpy as np
# Global model variables
model = None
tokenizer = None
snac_model = None
device = None
def load_models():
"""Initialize and load all required models for Sanskrit TTS inference."""
global model, tokenizer, snac_model, device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading models on: {device}")
# Load the fine-tuned Sanskrit TTS model
model, tokenizer = FastLanguageModel.from_pretrained(
"R910/Sanskrit_TTS_v2",
max_seq_length=2048,
dtype=None,
load_in_4bit=False,
)
model = model.to(device)
FastLanguageModel.for_inference(model)
# Load SNAC model for audio generation
try:
from snac import SNAC
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
except ImportError:
print("Warning: SNAC model import failed. Make sure SNAC is installed.")
snac_model.to("cpu")
print("Models loaded successfully!")
def redistribute_codes(code_list):
"""Redistribute generated codes into hierarchical layers for audio synthesis."""
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list)+1)//7):
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i+1]-4096)
layer_3.append(code_list[7*i+2]-(2*4096))
layer_3.append(code_list[7*i+3]-(3*4096))
layer_2.append(code_list[7*i+4]-(4*4096))
layer_3.append(code_list[7*i+5]-(5*4096))
layer_3.append(code_list[7*i+6]-(6*4096))
codes = [torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0)]
audio_hat = snac_model.decode(codes)
return audio_hat
def sanskrit_tts_inference(sanskrit_text, chosen_voice=""):
"""
Generate Sanskrit speech from input text using the fine-tuned model.
Args:
sanskrit_text (str): Input Sanskrit text in Devanagari script
chosen_voice (str): Voice selection parameter (optional)
Returns:
tuple: (audio_data, status_message)
"""
if not sanskrit_text.strip():
return None, "Please enter some Sanskrit text."
try:
prompts = [sanskrit_text]
chosen_voice = 1070
# Prepare prompts with voice selection
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
# Tokenize input prompts
all_input_ids = []
for prompt in prompts_:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
# Define special tokens
start_token = torch.tensor([[ 128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
# Construct modified input sequences
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
all_modified_input_ids.append(modified_input_ids)
# Apply padding and create attention masks
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
padding = max_length - modified_input_ids.shape[1]
padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
# Batch tensors for inference
all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)
# Generate audio codes using the model
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache=True
)
# Post-process generated tokens
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
mask = cropped_tensor != token_to_remove
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
# Convert tokens to audio codes
code_lists = []
for row in processed_rows:
row_length = row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = row[:new_length]
trimmed_row = [t - 128266 for t in trimmed_row]
code_lists.append(trimmed_row)
# Generate audio samples
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list)
my_samples.append(samples)
if len(my_samples) > 0:
audio_sample = my_samples[0].detach().squeeze().to("cpu").numpy()
return (24000, audio_sample), f"✅ Generated audio for: {sanskrit_text}"
else:
return None, "❌ Failed to generate audio - no valid codes produced."
except Exception as e:
return None, f"❌ Error during inference: {str(e)}"
# Initialize models
print("Loading models... This may take a moment.")
load_models()
# Create Gradio interface
with gr.Blocks(title="Sanskrit Text-to-Speech") as demo:
gr.Markdown("""
# 🕉️ Sanskrit Text-to-Speech
Enter Sanskrit text in Devanagari script and generate speech using your fine-tuned model.
""")
with gr.Row():
with gr.Column():
sanskrit_input = gr.Textbox(
label="Sanskrit Text",
placeholder="Enter Sanskrit text in Devanagari script...",
lines=3,
value="नमस्ते"
)
generate_btn = gr.Button("🎵 Generate Speech", variant="primary")
with gr.Column():
audio_output = gr.Audio(
label="Generated Sanskrit Speech",
type="numpy"
)
status_output = gr.Textbox(
label="Status",
lines=2,
interactive=False
)
# Example inputs for demonstration
gr.Examples(
examples=[
["नमस्ते"],
["संस्कृत एक प्राचीन भाषा है"],
["ॐ शान्ति शान्ति शान्तिः"],
["सर्वे भवन्तु सुखिनः"],
],
inputs=[sanskrit_input],
outputs=[audio_output, status_output],
fn=sanskrit_tts_inference,
cache_examples=False
)
# Connect interface components
generate_btn.click(
fn=sanskrit_tts_inference,
inputs=[sanskrit_input],
outputs=[audio_output, status_output]
)
# Launch the application
if __name__ == "__main__":
demo.launch(
share=True,
server_name="0.0.0.0",
server_port=7860,
show_error=True
)
```
## 🔊 Demo Outputs
<table>
<tr>
<td><strong>� यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।</strong><br/><em>Bhagavad Gita 4.7</em></td>
<td>
<audio controls>
<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<td><strong>🕉️ कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।</strong><br/><em>Bhagavad Gita 2.47</em></td>
<td>
<audio controls>
<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<td><strong>📚 विद्या ददाति विनयं</strong><br/><em>Subhashita</em></td>
<td>
<audio controls>
<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/विद्या ददाति विनयं.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<td><strong>🌟 तमसो मा ज्योतिर्गमय।</strong><br/><em>Brihadaranyaka Upanishad 1.3.28</em></td>
<td>
<audio controls>
<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/तमसो मा ज्योतिर्गमय।.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
</table>
## Model Information
**Developer:** R910
**License:** Apache 2.0
**Base Architecture:** Fine-tuned from unsloth/orpheus-3b-0.1-ft
This model has been optimized using Unsloth's efficient training framework, achieving 2x faster training speeds compared to standard implementations, in conjunction with Hugging Face's TRL (Transformer Reinforcement Learning) library.
## Citation
If you use this model or the training data, please cite:
```bibtex
@inproceedings{indictts,
title = {Building Open Sourced and Industry Grade Low-Resource {TTS} for {I}ndian Languages},
author = {ID Prakashraj and Abhayjeet Singh and Anusha Prakash and AV Anand Kumar and Shambavi Bhaskar
and Varun Srinivas and Vishal Sunder and Hema A Murthy and S Umesh},
booktitle = {Proc. Interspeech 2023},
year = {2023},
pages = {1009--1013},
doi = {10.21437/Interspeech.2023-1339}
}
```
Dataset source: [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database)
## Acknowledgments
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
Special thanks to the [Speech and Language Technology Group, IIT Madras](https://www.iitm.ac.in/donlab/indictts/) for providing the Sanskrit TTS dataset.
## Technical Specifications
- **Model Type:** Fine-tuned Language Model for Text-to-Speech
- **Architecture:** LLaMA-based with LoRA adaptation
- **Audio Output:** 24kHz sampling rate
- **Maximum Sequence Length:** 2048 tokens
- **Supported Script:** Devanagari (Sanskrit)
- **Training Framework:** Unsloth + Hugging Face TRL
## Usage Requirements
- **Hardware:** CUDA-compatible GPU
- **Dependencies:** PyTorch 2.4.1+, Transformers, SNAC audio codec
- **Python Version:** 3.7+