| | --- |
| | base_model: unsloth/orpheus-3b-0.1-ft |
| | model_type: llama |
| | library_name: transformers |
| | pipeline_tag: text-to-speech |
| | tags: |
| | - text-to-speech |
| | - tts |
| | - sanskrit |
| | - audio-generation |
| | - text-generation-inference |
| | - transformers |
| | - unsloth |
| | - llama |
| | - trl |
| | - fine-tuned |
| | - devanagari |
| | language: |
| | - en |
| | - sa |
| | datasets: |
| | - IIT-Madras-IndicTTS |
| | metrics: null |
| | widget: |
| | - text: यदा यदा हि धर्मस्य ग्लानिर्भवति भारत। |
| | example_title: Bhagavad Gita 4.7 |
| | - text: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन। |
| | example_title: Bhagavad Gita 2.47 |
| | - text: विद्या ददाति विनयं |
| | example_title: Subhashita |
| | - text: तमसो मा ज्योतिर्गमय। |
| | example_title: Brihadaranyaka Upanishad |
| | model-index: |
| | - name: Sanskrit TTS v2 |
| | results: |
| | - task: |
| | type: text-to-speech |
| | name: Text-to-Speech |
| | dataset: |
| | type: IIT-Madras-IndicTTS |
| | name: IIT Madras IndicTTS Sanskrit (Mono Female) |
| | metrics: |
| | - type: audio_duration |
| | name: Training Audio Duration |
| | value: 10.93 hrs |
| | --- |
| | [](https://colab.research.google.com/github/Rstar-910/SamskritaBharati/blob/main/Sanskrit_TTS_v2.ipynb) |
| |
|
| | # Sanskrit Text-to-Speech Model |
| | ## Model Overview |
| |
|
| | **Model ID:** R910/Sanskrit_TTS_v2 |
| | **Base Model:** unsloth/orpheus-3b-0.1-ft |
| | **Language:** English |
| | **Primary Dataset:** [IIT Madras IndicTTS Sanskrit Database](https://www.iitm.ac.in/donlab/indictts/database) |
| | **Voice:** Mono Female |
| | **Training Audio Duration:** 10.93 hours |
| | This fine-tuned Language Model (LLaMA) specializes in Sanskrit text-to-speech synthesis and has been optimized using Unsloth and Hugging Face's TRL library for enhanced training efficiency. |
| | ## Training Data |
| | The model was trained on the **Sanskrit speech corpus** from the [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database), using a **mono female voice** recording with a total audio duration of **10.93 hours**. The IndicTTS project, developed by the Speech and Language Technology Group at IIT Madras, provides high-quality speech corpora for Indic languages. |
| | ## Installation Requirements |
| | ### Environment Detection and Base Setup |
| | ```bash |
| | # Environment detection |
| | python3 -c " |
| | import os |
| | print('colab' if 'COLAB_' in ''.join(os.environ.keys()) else 'local') |
| | " |
| | |
| | # Install core dependencies |
| | pip install snac |
| | ``` |
| | ### Google Colab Installation |
| | For Google Colab environments, execute the following installation sequence: |
| | ```bash |
| | # Install Colab-specific dependencies |
| | pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo |
| | pip install sentencepiece protobuf 'datasets>=3.4.1,<4.0.0' huggingface_hub hf_transfer |
| | pip install --no-deps unsloth |
| | # Environment cleanup (recommended for clean installation) |
| | pip uninstall torch torchvision torchaudio unsloth unsloth_zoo transformers -y |
| | pip cache purge |
| | |
| | # Install PyTorch with CUDA 12.1 support |
| | pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121 |
| | |
| | # Install latest Unsloth from source |
| | pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" |
| | |
| | # Additional dependencies |
| | pip install librosa |
| | pip install -U datasets |
| | ``` |
| | ## Implementation Guide |
| | ### Complete Implementation Code |
| | ```python |
| | import gradio as gr |
| | import torch |
| | from unsloth import FastLanguageModel |
| | from IPython.display import display, Audio |
| | import numpy as np |
| | |
| | # Global model variables |
| | model = None |
| | tokenizer = None |
| | snac_model = None |
| | device = None |
| | def load_models(): |
| | """Initialize and load all required models for Sanskrit TTS inference.""" |
| | global model, tokenizer, snac_model, device |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | print(f"Loading models on: {device}") |
| | |
| | # Load the fine-tuned Sanskrit TTS model |
| | model, tokenizer = FastLanguageModel.from_pretrained( |
| | "R910/Sanskrit_TTS_v2", |
| | max_seq_length=2048, |
| | dtype=None, |
| | load_in_4bit=False, |
| | ) |
| | |
| | model = model.to(device) |
| | FastLanguageModel.for_inference(model) |
| | |
| | # Load SNAC model for audio generation |
| | try: |
| | from snac import SNAC |
| | snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval() |
| | except ImportError: |
| | print("Warning: SNAC model import failed. Make sure SNAC is installed.") |
| | snac_model.to("cpu") |
| | |
| | print("Models loaded successfully!") |
| | def redistribute_codes(code_list): |
| | """Redistribute generated codes into hierarchical layers for audio synthesis.""" |
| | layer_1 = [] |
| | layer_2 = [] |
| | layer_3 = [] |
| | |
| | for i in range((len(code_list)+1)//7): |
| | layer_1.append(code_list[7*i]) |
| | layer_2.append(code_list[7*i+1]-4096) |
| | layer_3.append(code_list[7*i+2]-(2*4096)) |
| | layer_3.append(code_list[7*i+3]-(3*4096)) |
| | layer_2.append(code_list[7*i+4]-(4*4096)) |
| | layer_3.append(code_list[7*i+5]-(5*4096)) |
| | layer_3.append(code_list[7*i+6]-(6*4096)) |
| | |
| | codes = [torch.tensor(layer_1).unsqueeze(0), |
| | torch.tensor(layer_2).unsqueeze(0), |
| | torch.tensor(layer_3).unsqueeze(0)] |
| | |
| | audio_hat = snac_model.decode(codes) |
| | return audio_hat |
| | def sanskrit_tts_inference(sanskrit_text, chosen_voice=""): |
| | """ |
| | Generate Sanskrit speech from input text using the fine-tuned model. |
| | |
| | Args: |
| | sanskrit_text (str): Input Sanskrit text in Devanagari script |
| | chosen_voice (str): Voice selection parameter (optional) |
| | |
| | Returns: |
| | tuple: (audio_data, status_message) |
| | """ |
| | if not sanskrit_text.strip(): |
| | return None, "Please enter some Sanskrit text." |
| | |
| | try: |
| | prompts = [sanskrit_text] |
| | chosen_voice = 1070 |
| | |
| | # Prepare prompts with voice selection |
| | prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts] |
| | |
| | # Tokenize input prompts |
| | all_input_ids = [] |
| | for prompt in prompts_: |
| | input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
| | all_input_ids.append(input_ids) |
| | |
| | # Define special tokens |
| | start_token = torch.tensor([[ 128259]], dtype=torch.int64) |
| | end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) |
| | |
| | # Construct modified input sequences |
| | all_modified_input_ids = [] |
| | for input_ids in all_input_ids: |
| | modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) |
| | all_modified_input_ids.append(modified_input_ids) |
| | |
| | # Apply padding and create attention masks |
| | all_padded_tensors = [] |
| | all_attention_masks = [] |
| | max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids]) |
| | |
| | for modified_input_ids in all_modified_input_ids: |
| | padding = max_length - modified_input_ids.shape[1] |
| | padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1) |
| | attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1) |
| | all_padded_tensors.append(padded_tensor) |
| | all_attention_masks.append(attention_mask) |
| | |
| | # Batch tensors for inference |
| | all_padded_tensors = torch.cat(all_padded_tensors, dim=0) |
| | all_attention_masks = torch.cat(all_attention_masks, dim=0) |
| | |
| | input_ids = all_padded_tensors.to(device) |
| | attention_mask = all_attention_masks.to(device) |
| | |
| | # Generate audio codes using the model |
| | generated_ids = model.generate( |
| | input_ids=input_ids, |
| | attention_mask=attention_mask, |
| | max_new_tokens=1200, |
| | do_sample=True, |
| | temperature=0.6, |
| | top_p=0.95, |
| | repetition_penalty=1.1, |
| | num_return_sequences=1, |
| | eos_token_id=128258, |
| | use_cache=True |
| | ) |
| | |
| | # Post-process generated tokens |
| | token_to_find = 128257 |
| | token_to_remove = 128258 |
| | |
| | token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True) |
| | |
| | if len(token_indices[1]) > 0: |
| | last_occurrence_idx = token_indices[1][-1].item() |
| | cropped_tensor = generated_ids[:, last_occurrence_idx+1:] |
| | else: |
| | cropped_tensor = generated_ids |
| | |
| | mask = cropped_tensor != token_to_remove |
| | |
| | processed_rows = [] |
| | for row in cropped_tensor: |
| | masked_row = row[row != token_to_remove] |
| | processed_rows.append(masked_row) |
| | |
| | # Convert tokens to audio codes |
| | code_lists = [] |
| | for row in processed_rows: |
| | row_length = row.size(0) |
| | new_length = (row_length // 7) * 7 |
| | trimmed_row = row[:new_length] |
| | trimmed_row = [t - 128266 for t in trimmed_row] |
| | code_lists.append(trimmed_row) |
| | |
| | # Generate audio samples |
| | my_samples = [] |
| | for code_list in code_lists: |
| | samples = redistribute_codes(code_list) |
| | my_samples.append(samples) |
| | |
| | if len(my_samples) > 0: |
| | audio_sample = my_samples[0].detach().squeeze().to("cpu").numpy() |
| | return (24000, audio_sample), f"✅ Generated audio for: {sanskrit_text}" |
| | else: |
| | return None, "❌ Failed to generate audio - no valid codes produced." |
| | |
| | except Exception as e: |
| | return None, f"❌ Error during inference: {str(e)}" |
| | # Initialize models |
| | print("Loading models... This may take a moment.") |
| | load_models() |
| | # Create Gradio interface |
| | with gr.Blocks(title="Sanskrit Text-to-Speech") as demo: |
| | gr.Markdown(""" |
| | # 🕉️ Sanskrit Text-to-Speech |
| | |
| | Enter Sanskrit text in Devanagari script and generate speech using your fine-tuned model. |
| | """) |
| | |
| | with gr.Row(): |
| | with gr.Column(): |
| | sanskrit_input = gr.Textbox( |
| | label="Sanskrit Text", |
| | placeholder="Enter Sanskrit text in Devanagari script...", |
| | lines=3, |
| | value="नमस्ते" |
| | ) |
| | |
| | generate_btn = gr.Button("🎵 Generate Speech", variant="primary") |
| | |
| | with gr.Column(): |
| | audio_output = gr.Audio( |
| | label="Generated Sanskrit Speech", |
| | type="numpy" |
| | ) |
| | |
| | status_output = gr.Textbox( |
| | label="Status", |
| | lines=2, |
| | interactive=False |
| | ) |
| | |
| | # Example inputs for demonstration |
| | gr.Examples( |
| | examples=[ |
| | ["नमस्ते"], |
| | ["संस्कृत एक प्राचीन भाषा है"], |
| | ["ॐ शान्ति शान्ति शान्तिः"], |
| | ["सर्वे भवन्तु सुखिनः"], |
| | ], |
| | inputs=[sanskrit_input], |
| | outputs=[audio_output, status_output], |
| | fn=sanskrit_tts_inference, |
| | cache_examples=False |
| | ) |
| | |
| | # Connect interface components |
| | generate_btn.click( |
| | fn=sanskrit_tts_inference, |
| | inputs=[sanskrit_input], |
| | outputs=[audio_output, status_output] |
| | ) |
| | # Launch the application |
| | if __name__ == "__main__": |
| | demo.launch( |
| | share=True, |
| | server_name="0.0.0.0", |
| | server_port=7860, |
| | show_error=True |
| | ) |
| | ``` |
| | ## 🔊 Demo Outputs |
| | <table> |
| | <tr> |
| | <td><strong>� यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।</strong><br/><em>Bhagavad Gita 4.7</em></td> |
| | <td> |
| | <audio controls> |
| | <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td><strong>🕉️ कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।</strong><br/><em>Bhagavad Gita 2.47</em></td> |
| | <td> |
| | <audio controls> |
| | <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td><strong>📚 विद्या ददाति विनयं</strong><br/><em>Subhashita</em></td> |
| | <td> |
| | <audio controls> |
| | <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/विद्या ददाति विनयं.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td><strong>🌟 तमसो मा ज्योतिर्गमय।</strong><br/><em>Brihadaranyaka Upanishad 1.3.28</em></td> |
| | <td> |
| | <audio controls> |
| | <source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/तमसो मा ज्योतिर्गमय।.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </td> |
| | </tr> |
| | </table> |
| | |
| |
|
| | ## Model Information |
| |
|
| | **Developer:** R910 |
| | **License:** Apache 2.0 |
| | **Base Architecture:** Fine-tuned from unsloth/orpheus-3b-0.1-ft |
| |
|
| | This model has been optimized using Unsloth's efficient training framework, achieving 2x faster training speeds compared to standard implementations, in conjunction with Hugging Face's TRL (Transformer Reinforcement Learning) library. |
| |
|
| | ## Citation |
| |
|
| | If you use this model or the training data, please cite: |
| |
|
| | ```bibtex |
| | @inproceedings{indictts, |
| | title = {Building Open Sourced and Industry Grade Low-Resource {TTS} for {I}ndian Languages}, |
| | author = {ID Prakashraj and Abhayjeet Singh and Anusha Prakash and AV Anand Kumar and Shambavi Bhaskar |
| | and Varun Srinivas and Vishal Sunder and Hema A Murthy and S Umesh}, |
| | booktitle = {Proc. Interspeech 2023}, |
| | year = {2023}, |
| | pages = {1009--1013}, |
| | doi = {10.21437/Interspeech.2023-1339} |
| | } |
| | ``` |
| |
|
| | Dataset source: [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database) |
| |
|
| | ## Acknowledgments |
| |
|
| | [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
| |
|
| | Special thanks to the [Speech and Language Technology Group, IIT Madras](https://www.iitm.ac.in/donlab/indictts/) for providing the Sanskrit TTS dataset. |
| |
|
| | ## Technical Specifications |
| |
|
| | - **Model Type:** Fine-tuned Language Model for Text-to-Speech |
| | - **Architecture:** LLaMA-based with LoRA adaptation |
| | - **Audio Output:** 24kHz sampling rate |
| | - **Maximum Sequence Length:** 2048 tokens |
| | - **Supported Script:** Devanagari (Sanskrit) |
| | - **Training Framework:** Unsloth + Hugging Face TRL |
| |
|
| | ## Usage Requirements |
| |
|
| | - **Hardware:** CUDA-compatible GPU |
| | - **Dependencies:** PyTorch 2.4.1+, Transformers, SNAC audio codec |
| | - **Python Version:** 3.7+ |