--- base_model: unsloth/orpheus-3b-0.1-ft model_type: llama library_name: transformers pipeline_tag: text-to-speech tags: - text-to-speech - tts - sanskrit - audio-generation - text-generation-inference - transformers - unsloth - llama - trl - fine-tuned - devanagari language: - en - sa datasets: - IIT-Madras-IndicTTS metrics: null widget: - text: यदा यदा हि धर्मस्य ग्लानिर्भवति भारत। example_title: Bhagavad Gita 4.7 - text: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन। example_title: Bhagavad Gita 2.47 - text: विद्या ददाति विनयं example_title: Subhashita - text: तमसो मा ज्योतिर्गमय। example_title: Brihadaranyaka Upanishad model-index: - name: Sanskrit TTS v2 results: - task: type: text-to-speech name: Text-to-Speech dataset: type: IIT-Madras-IndicTTS name: IIT Madras IndicTTS Sanskrit (Mono Female) metrics: - type: audio_duration name: Training Audio Duration value: 10.93 hrs --- [](https://colab.research.google.com/github/Rstar-910/SamskritaBharati/blob/main/Sanskrit_TTS_v2.ipynb) # Sanskrit Text-to-Speech Model ## Model Overview **Model ID:** R910/Sanskrit_TTS_v2 **Base Model:** unsloth/orpheus-3b-0.1-ft **Language:** English **Primary Dataset:** [IIT Madras IndicTTS Sanskrit Database](https://www.iitm.ac.in/donlab/indictts/database) **Voice:** Mono Female **Training Audio Duration:** 10.93 hours This fine-tuned Language Model (LLaMA) specializes in Sanskrit text-to-speech synthesis and has been optimized using Unsloth and Hugging Face's TRL library for enhanced training efficiency. ## Training Data The model was trained on the **Sanskrit speech corpus** from the [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database), using a **mono female voice** recording with a total audio duration of **10.93 hours**. The IndicTTS project, developed by the Speech and Language Technology Group at IIT Madras, provides high-quality speech corpora for Indic languages. ## Installation Requirements ### Environment Detection and Base Setup ```bash # Environment detection python3 -c " import os print('colab' if 'COLAB_' in ''.join(os.environ.keys()) else 'local') " # Install core dependencies pip install snac ``` ### Google Colab Installation For Google Colab environments, execute the following installation sequence: ```bash # Install Colab-specific dependencies pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo pip install sentencepiece protobuf 'datasets>=3.4.1,<4.0.0' huggingface_hub hf_transfer pip install --no-deps unsloth # Environment cleanup (recommended for clean installation) pip uninstall torch torchvision torchaudio unsloth unsloth_zoo transformers -y pip cache purge # Install PyTorch with CUDA 12.1 support pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121 # Install latest Unsloth from source pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" # Additional dependencies pip install librosa pip install -U datasets ``` ## Implementation Guide ### Complete Implementation Code ```python import gradio as gr import torch from unsloth import FastLanguageModel from IPython.display import display, Audio import numpy as np # Global model variables model = None tokenizer = None snac_model = None device = None def load_models(): """Initialize and load all required models for Sanskrit TTS inference.""" global model, tokenizer, snac_model, device device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Loading models on: {device}") # Load the fine-tuned Sanskrit TTS model model, tokenizer = FastLanguageModel.from_pretrained( "R910/Sanskrit_TTS_v2", max_seq_length=2048, dtype=None, load_in_4bit=False, ) model = model.to(device) FastLanguageModel.for_inference(model) # Load SNAC model for audio generation try: from snac import SNAC snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval() except ImportError: print("Warning: SNAC model import failed. Make sure SNAC is installed.") snac_model.to("cpu") print("Models loaded successfully!") def redistribute_codes(code_list): """Redistribute generated codes into hierarchical layers for audio synthesis.""" layer_1 = [] layer_2 = [] layer_3 = [] for i in range((len(code_list)+1)//7): layer_1.append(code_list[7*i]) layer_2.append(code_list[7*i+1]-4096) layer_3.append(code_list[7*i+2]-(2*4096)) layer_3.append(code_list[7*i+3]-(3*4096)) layer_2.append(code_list[7*i+4]-(4*4096)) layer_3.append(code_list[7*i+5]-(5*4096)) layer_3.append(code_list[7*i+6]-(6*4096)) codes = [torch.tensor(layer_1).unsqueeze(0), torch.tensor(layer_2).unsqueeze(0), torch.tensor(layer_3).unsqueeze(0)] audio_hat = snac_model.decode(codes) return audio_hat def sanskrit_tts_inference(sanskrit_text, chosen_voice=""): """ Generate Sanskrit speech from input text using the fine-tuned model. Args: sanskrit_text (str): Input Sanskrit text in Devanagari script chosen_voice (str): Voice selection parameter (optional) Returns: tuple: (audio_data, status_message) """ if not sanskrit_text.strip(): return None, "Please enter some Sanskrit text." try: prompts = [sanskrit_text] chosen_voice = 1070 # Prepare prompts with voice selection prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts] # Tokenize input prompts all_input_ids = [] for prompt in prompts_: input_ids = tokenizer(prompt, return_tensors="pt").input_ids all_input_ids.append(input_ids) # Define special tokens start_token = torch.tensor([[ 128259]], dtype=torch.int64) end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # Construct modified input sequences all_modified_input_ids = [] for input_ids in all_input_ids: modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) all_modified_input_ids.append(modified_input_ids) # Apply padding and create attention masks all_padded_tensors = [] all_attention_masks = [] max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids]) for modified_input_ids in all_modified_input_ids: padding = max_length - modified_input_ids.shape[1] padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1) attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1) all_padded_tensors.append(padded_tensor) all_attention_masks.append(attention_mask) # Batch tensors for inference all_padded_tensors = torch.cat(all_padded_tensors, dim=0) all_attention_masks = torch.cat(all_attention_masks, dim=0) input_ids = all_padded_tensors.to(device) attention_mask = all_attention_masks.to(device) # Generate audio codes using the model generated_ids = model.generate( input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=1200, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.1, num_return_sequences=1, eos_token_id=128258, use_cache=True ) # Post-process generated tokens token_to_find = 128257 token_to_remove = 128258 token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True) if len(token_indices[1]) > 0: last_occurrence_idx = token_indices[1][-1].item() cropped_tensor = generated_ids[:, last_occurrence_idx+1:] else: cropped_tensor = generated_ids mask = cropped_tensor != token_to_remove processed_rows = [] for row in cropped_tensor: masked_row = row[row != token_to_remove] processed_rows.append(masked_row) # Convert tokens to audio codes code_lists = [] for row in processed_rows: row_length = row.size(0) new_length = (row_length // 7) * 7 trimmed_row = row[:new_length] trimmed_row = [t - 128266 for t in trimmed_row] code_lists.append(trimmed_row) # Generate audio samples my_samples = [] for code_list in code_lists: samples = redistribute_codes(code_list) my_samples.append(samples) if len(my_samples) > 0: audio_sample = my_samples[0].detach().squeeze().to("cpu").numpy() return (24000, audio_sample), f"✅ Generated audio for: {sanskrit_text}" else: return None, "❌ Failed to generate audio - no valid codes produced." except Exception as e: return None, f"❌ Error during inference: {str(e)}" # Initialize models print("Loading models... This may take a moment.") load_models() # Create Gradio interface with gr.Blocks(title="Sanskrit Text-to-Speech") as demo: gr.Markdown(""" # 🕉️ Sanskrit Text-to-Speech Enter Sanskrit text in Devanagari script and generate speech using your fine-tuned model. """) with gr.Row(): with gr.Column(): sanskrit_input = gr.Textbox( label="Sanskrit Text", placeholder="Enter Sanskrit text in Devanagari script...", lines=3, value="नमस्ते" ) generate_btn = gr.Button("🎵 Generate Speech", variant="primary") with gr.Column(): audio_output = gr.Audio( label="Generated Sanskrit Speech", type="numpy" ) status_output = gr.Textbox( label="Status", lines=2, interactive=False ) # Example inputs for demonstration gr.Examples( examples=[ ["नमस्ते"], ["संस्कृत एक प्राचीन भाषा है"], ["ॐ शान्ति शान्ति शान्तिः"], ["सर्वे भवन्तु सुखिनः"], ], inputs=[sanskrit_input], outputs=[audio_output, status_output], fn=sanskrit_tts_inference, cache_examples=False ) # Connect interface components generate_btn.click( fn=sanskrit_tts_inference, inputs=[sanskrit_input], outputs=[audio_output, status_output] ) # Launch the application if __name__ == "__main__": demo.launch( share=True, server_name="0.0.0.0", server_port=7860, show_error=True ) ``` ## 🔊 Demo Outputs
| � यदा यदा हि धर्मस्य ग्लानिर्भवति भारत। Bhagavad Gita 4.7 |
|
| 🕉️ कर्मण्येवाधिकारस्ते मा फलेषु कदाचन। Bhagavad Gita 2.47 |
|
| 📚 विद्या ददाति विनयं Subhashita |
|
| 🌟 तमसो मा ज्योतिर्गमय। Brihadaranyaka Upanishad 1.3.28 |
](https://github.com/unslothai/unsloth)
Special thanks to the [Speech and Language Technology Group, IIT Madras](https://www.iitm.ac.in/donlab/indictts/) for providing the Sanskrit TTS dataset.
## Technical Specifications
- **Model Type:** Fine-tuned Language Model for Text-to-Speech
- **Architecture:** LLaMA-based with LoRA adaptation
- **Audio Output:** 24kHz sampling rate
- **Maximum Sequence Length:** 2048 tokens
- **Supported Script:** Devanagari (Sanskrit)
- **Training Framework:** Unsloth + Hugging Face TRL
## Usage Requirements
- **Hardware:** CUDA-compatible GPU
- **Dependencies:** PyTorch 2.4.1+, Transformers, SNAC audio codec
- **Python Version:** 3.7+