README.md · R910/Sanskrit_TTS

File size: 15,381 Bytes

a68320c

 ---
base_model: unsloth/orpheus-3b-0.1-ft
model_type: llama
library_name: transformers
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- sanskrit
- audio-generation
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- fine-tuned
- devanagari
language:
- en
- sa
datasets:
- IIT-Madras-IndicTTS
metrics: null
widget:
- text: यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।
  example_title: Bhagavad Gita 4.7
- text: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।
  example_title: Bhagavad Gita 2.47
- text: विद्या ददाति विनयं
  example_title: Subhashita
- text: तमसो मा ज्योतिर्गमय।
  example_title: Brihadaranyaka Upanishad
model-index:
- name: Sanskrit TTS v2
  results:
  - task:
      type: text-to-speech
      name: Text-to-Speech
    dataset:
      type: IIT-Madras-IndicTTS
      name: IIT Madras IndicTTS Sanskrit (Mono Female)
    metrics:
    - type: audio_duration
      name: Training Audio Duration
      value: 10.93 hrs
---
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Rstar-910/SamskritaBharati/blob/main/Sanskrit_TTS_v2.ipynb)

# Sanskrit Text-to-Speech Model
## Model Overview

**Model ID:** R910/Sanskrit_TTS_v2  
**Base Model:** unsloth/orpheus-3b-0.1-ft  
**Language:** English
**Primary Dataset:** [IIT Madras IndicTTS Sanskrit Database](https://www.iitm.ac.in/donlab/indictts/database)  
**Voice:** Mono Female  
**Training Audio Duration:** 10.93 hours  
This fine-tuned Language Model (LLaMA) specializes in Sanskrit text-to-speech synthesis and has been optimized using Unsloth and Hugging Face's TRL library for enhanced training efficiency.
## Training Data
The model was trained on the **Sanskrit speech corpus** from the [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database), using a **mono female voice** recording with a total audio duration of **10.93 hours**. The IndicTTS project, developed by the Speech and Language Technology Group at IIT Madras, provides high-quality speech corpora for Indic languages.
## Installation Requirements
### Environment Detection and Base Setup
```bash
# Environment detection
python3 -c "
import os
print('colab' if 'COLAB_' in ''.join(os.environ.keys()) else 'local')
"

# Install core dependencies
pip install snac
```
### Google Colab Installation
For Google Colab environments, execute the following installation sequence:
```bash
# Install Colab-specific dependencies
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf 'datasets>=3.4.1,<4.0.0' huggingface_hub hf_transfer
pip install --no-deps unsloth
# Environment cleanup (recommended for clean installation)
pip uninstall torch torchvision torchaudio unsloth unsloth_zoo transformers -y
pip cache purge

# Install PyTorch with CUDA 12.1 support
pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121

# Install latest Unsloth from source
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Additional dependencies
pip install librosa
pip install -U datasets
```
## Implementation Guide
### Complete Implementation Code
```python
import gradio as gr
import torch
from unsloth import FastLanguageModel
from IPython.display import display, Audio
import numpy as np

# Global model variables
model = None
tokenizer = None
snac_model = None
device = None
def load_models():
    """Initialize and load all required models for Sanskrit TTS inference."""
    global model, tokenizer, snac_model, device
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Loading models on: {device}")
    
    # Load the fine-tuned Sanskrit TTS model
    model, tokenizer = FastLanguageModel.from_pretrained(
        "R910/Sanskrit_TTS_v2",  
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=False,
    )
    
    model = model.to(device)
    FastLanguageModel.for_inference(model)
    
    # Load SNAC model for audio generation
    try:
        from snac import SNAC
        snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
    except ImportError:
        print("Warning: SNAC model import failed. Make sure SNAC is installed.")
    snac_model.to("cpu")
    
    print("Models loaded successfully!")
def redistribute_codes(code_list):
    """Redistribute generated codes into hierarchical layers for audio synthesis."""
    layer_1 = []
    layer_2 = []
    layer_3 = []
    
    for i in range((len(code_list)+1)//7):
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i+1]-4096)
        layer_3.append(code_list[7*i+2]-(2*4096))
        layer_3.append(code_list[7*i+3]-(3*4096))
        layer_2.append(code_list[7*i+4]-(4*4096))
        layer_3.append(code_list[7*i+5]-(5*4096))
        layer_3.append(code_list[7*i+6]-(6*4096))
    
    codes = [torch.tensor(layer_1).unsqueeze(0),
             torch.tensor(layer_2).unsqueeze(0),
             torch.tensor(layer_3).unsqueeze(0)]
    
    audio_hat = snac_model.decode(codes)
    return audio_hat
def sanskrit_tts_inference(sanskrit_text, chosen_voice=""):
    """
    Generate Sanskrit speech from input text using the fine-tuned model.
    
    Args:
        sanskrit_text (str): Input Sanskrit text in Devanagari script
        chosen_voice (str): Voice selection parameter (optional)
    
    Returns:
        tuple: (audio_data, status_message)
    """
    if not sanskrit_text.strip():
        return None, "Please enter some Sanskrit text."
    
    try:
        prompts = [sanskrit_text]
        chosen_voice = 1070
        
        # Prepare prompts with voice selection
        prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
        
        # Tokenize input prompts
        all_input_ids = []
        for prompt in prompts_:
            input_ids = tokenizer(prompt, return_tensors="pt").input_ids
            all_input_ids.append(input_ids)
        
        # Define special tokens
        start_token = torch.tensor([[ 128259]], dtype=torch.int64)
        end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) 
        
        # Construct modified input sequences
        all_modified_input_ids = []
        for input_ids in all_input_ids:
            modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) 
            all_modified_input_ids.append(modified_input_ids)
        
        # Apply padding and create attention masks
        all_padded_tensors = []
        all_attention_masks = []
        max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
        
        for modified_input_ids in all_modified_input_ids:
            padding = max_length - modified_input_ids.shape[1]
            padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
            attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
            all_padded_tensors.append(padded_tensor)
            all_attention_masks.append(attention_mask)
        
        # Batch tensors for inference
        all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
        all_attention_masks = torch.cat(all_attention_masks, dim=0)
        
        input_ids = all_padded_tensors.to(device)
        attention_mask = all_attention_masks.to(device)
        
        # Generate audio codes using the model
        generated_ids = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=1200,
            do_sample=True,
            temperature=0.6,
            top_p=0.95,
            repetition_penalty=1.1,
            num_return_sequences=1,
            eos_token_id=128258,
            use_cache=True
        )
        
        # Post-process generated tokens
        token_to_find = 128257
        token_to_remove = 128258
        
        token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
        
        if len(token_indices[1]) > 0:
            last_occurrence_idx = token_indices[1][-1].item()
            cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
        else:
            cropped_tensor = generated_ids
        
        mask = cropped_tensor != token_to_remove
        
        processed_rows = []
        for row in cropped_tensor:
            masked_row = row[row != token_to_remove]
            processed_rows.append(masked_row)
        
        # Convert tokens to audio codes
        code_lists = []
        for row in processed_rows:
            row_length = row.size(0)
            new_length = (row_length // 7) * 7
            trimmed_row = row[:new_length]
            trimmed_row = [t - 128266 for t in trimmed_row]
            code_lists.append(trimmed_row)
        
        # Generate audio samples
        my_samples = []
        for code_list in code_lists:
            samples = redistribute_codes(code_list)
            my_samples.append(samples)
        
        if len(my_samples) > 0:
            audio_sample = my_samples[0].detach().squeeze().to("cpu").numpy()
            return (24000, audio_sample), f"✅ Generated audio for: {sanskrit_text}"
        else:
            return None, "❌ Failed to generate audio - no valid codes produced."
            
    except Exception as e:
        return None, f"❌ Error during inference: {str(e)}"
# Initialize models
print("Loading models... This may take a moment.")
load_models()
# Create Gradio interface
with gr.Blocks(title="Sanskrit Text-to-Speech") as demo:
    gr.Markdown("""
    # 🕉️ Sanskrit Text-to-Speech
    
    Enter Sanskrit text in Devanagari script and generate speech using your fine-tuned model.
    """)
    
    with gr.Row():
        with gr.Column():
            sanskrit_input = gr.Textbox(
                label="Sanskrit Text",
                placeholder="Enter Sanskrit text in Devanagari script...",
                lines=3,
                value="नमस्ते" 
            )
            
            generate_btn = gr.Button("🎵 Generate Speech", variant="primary")
        
        with gr.Column():
            audio_output = gr.Audio(
                label="Generated Sanskrit Speech",
                type="numpy"
            )
            
            status_output = gr.Textbox(
                label="Status",
                lines=2,
                interactive=False
            )
    
    # Example inputs for demonstration
    gr.Examples(
        examples=[
            ["नमस्ते"],
            ["संस्कृत एक प्राचीन भाषा है"],
            ["ॐ शान्ति शान्ति शान्तिः"],
            ["सर्वे भवन्तु सुखिनः"],
        ],
        inputs=[sanskrit_input],
        outputs=[audio_output, status_output],
        fn=sanskrit_tts_inference,
        cache_examples=False
    )
    
    # Connect interface components
    generate_btn.click(
        fn=sanskrit_tts_inference,
        inputs=[sanskrit_input],
        outputs=[audio_output, status_output]
    )
# Launch the application
if __name__ == "__main__":
    demo.launch(
        share=True,  
        server_name="0.0.0.0",  
        server_port=7860,
        show_error=True
    )
```
## 🔊 Demo Outputs
<table>
  <tr>
    <td><strong>� यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।</strong><br/><em>Bhagavad Gita 4.7</em></td>
    <td>
      <audio controls>
        <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।.wav" type="audio/wav">
        Your browser does not support the audio element.
      </audio>
    </td>
  </tr>
  <tr>
    <td><strong>🕉️ कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।</strong><br/><em>Bhagavad Gita 2.47</em></td>
    <td>
      <audio controls>
        <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।.wav" type="audio/wav">
        Your browser does not support the audio element.
      </audio>
    </td>
  </tr>
  <tr>
    <td><strong>📚 विद्या ददाति विनयं</strong><br/><em>Subhashita</em></td>
    <td>
      <audio controls>
        <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/विद्या ददाति विनयं.wav" type="audio/wav">
        Your browser does not support the audio element.
      </audio>
    </td>
  </tr>
  <tr>
    <td><strong>🌟 तमसो मा ज्योतिर्गमय।</strong><br/><em>Brihadaranyaka Upanishad 1.3.28</em></td>
    <td>
      <audio controls>
        <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/तमसो मा ज्योतिर्गमय।.wav" type="audio/wav">
        Your browser does not support the audio element.
      </audio>
    </td>
  </tr>
</table>


## Model Information

**Developer:** R910  
**License:** Apache 2.0  
**Base Architecture:** Fine-tuned from unsloth/orpheus-3b-0.1-ft  

This model has been optimized using Unsloth's efficient training framework, achieving 2x faster training speeds compared to standard implementations, in conjunction with Hugging Face's TRL (Transformer Reinforcement Learning) library.

## Citation

If you use this model or the training data, please cite:

```bibtex
@inproceedings{indictts,
  title     = {Building Open Sourced and Industry Grade Low-Resource {TTS} for {I}ndian Languages},
  author    = {ID Prakashraj and Abhayjeet Singh and Anusha Prakash and AV Anand Kumar and Shambavi Bhaskar
               and Varun Srinivas and Vishal Sunder and Hema A Murthy and S Umesh},
  booktitle = {Proc. Interspeech 2023},
  year      = {2023},
  pages     = {1009--1013},
  doi       = {10.21437/Interspeech.2023-1339}
}
```

Dataset source: [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database)

## Acknowledgments

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

Special thanks to the [Speech and Language Technology Group, IIT Madras](https://www.iitm.ac.in/donlab/indictts/) for providing the Sanskrit TTS dataset.

## Technical Specifications

- **Model Type:** Fine-tuned Language Model for Text-to-Speech
- **Architecture:** LLaMA-based with LoRA adaptation
- **Audio Output:** 24kHz sampling rate
- **Maximum Sequence Length:** 2048 tokens
- **Supported Script:** Devanagari (Sanskrit)
- **Training Framework:** Unsloth + Hugging Face TRL

## Usage Requirements

- **Hardware:** CUDA-compatible GPU 
- **Dependencies:** PyTorch 2.4.1+, Transformers, SNAC audio codec
- **Python Version:** 3.7+