README.md · R910/Sanskrit_TTS

Sanskrit_TTS_v2 / README.md

R910

Rename Readme.md to README.md

8102e34 verified 12 days ago

preview code

raw

history blame contribute delete

15.4 kB

	---
	base_model: unsloth/orpheus-3b-0.1-ft
	model_type: llama
	library_name: transformers
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- tts
	- sanskrit
	- audio-generation
	- text-generation-inference
	- transformers
	- unsloth
	- llama
	- trl
	- fine-tuned
	- devanagari
	language:
	- en
	- sa
	datasets:
	- IIT-Madras-IndicTTS
	metrics: null
	widget:
	- text: यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।
	example_title: Bhagavad Gita 4.7
	- text: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।
	example_title: Bhagavad Gita 2.47
	- text: विद्या ददाति विनयं
	example_title: Subhashita
	- text: तमसो मा ज्योतिर्गमय।
	example_title: Brihadaranyaka Upanishad
	model-index:
	- name: Sanskrit TTS v2
	results:
	- task:
	type: text-to-speech
	name: Text-to-Speech
	dataset:
	type: IIT-Madras-IndicTTS
	name: IIT Madras IndicTTS Sanskrit (Mono Female)
	metrics:
	- type: audio_duration
	name: Training Audio Duration
	value: 10.93 hrs
	---
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Rstar-910/SamskritaBharati/blob/main/Sanskrit_TTS_v2.ipynb)

	# Sanskrit Text-to-Speech Model
	## Model Overview

	Model ID: R910/Sanskrit_TTS_v2
	Base Model: unsloth/orpheus-3b-0.1-ft
	Language: English
	Primary Dataset: [IIT Madras IndicTTS Sanskrit Database](https://www.iitm.ac.in/donlab/indictts/database)
	Voice: Mono Female
	Training Audio Duration: 10.93 hours
	This fine-tuned Language Model (LLaMA) specializes in Sanskrit text-to-speech synthesis and has been optimized using Unsloth and Hugging Face's TRL library for enhanced training efficiency.
	## Training Data
	The model was trained on the Sanskrit speech corpus from the [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database), using a mono female voice recording with a total audio duration of 10.93 hours. The IndicTTS project, developed by the Speech and Language Technology Group at IIT Madras, provides high-quality speech corpora for Indic languages.
	## Installation Requirements
	### Environment Detection and Base Setup
	```bash
	# Environment detection
	python3 -c "
	import os
	print('colab' if 'COLAB_' in ''.join(os.environ.keys()) else 'local')
	"

	# Install core dependencies
	pip install snac
	```
	### Google Colab Installation
	For Google Colab environments, execute the following installation sequence:
	```bash
	# Install Colab-specific dependencies
	pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
	pip install sentencepiece protobuf 'datasets>=3.4.1,<4.0.0' huggingface_hub hf_transfer
	pip install --no-deps unsloth
	# Environment cleanup (recommended for clean installation)
	pip uninstall torch torchvision torchaudio unsloth unsloth_zoo transformers -y
	pip cache purge

	# Install PyTorch with CUDA 12.1 support
	pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121

	# Install latest Unsloth from source
	pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

	# Additional dependencies
	pip install librosa
	pip install -U datasets
	```
	## Implementation Guide
	### Complete Implementation Code
	```python
	import gradio as gr
	import torch
	from unsloth import FastLanguageModel
	from IPython.display import display, Audio
	import numpy as np

	# Global model variables
	model = None
	tokenizer = None
	snac_model = None
	device = None
	def load_models():
	"""Initialize and load all required models for Sanskrit TTS inference."""
	global model, tokenizer, snac_model, device
	device = "cuda" if torch.cuda.is_available() else "cpu"
	print(f"Loading models on: {device}")

	# Load the fine-tuned Sanskrit TTS model
	model, tokenizer = FastLanguageModel.from_pretrained(
	"R910/Sanskrit_TTS_v2",
	max_seq_length=2048,
	dtype=None,
	load_in_4bit=False,
	)

	model = model.to(device)
	FastLanguageModel.for_inference(model)

	# Load SNAC model for audio generation
	try:
	from snac import SNAC
	snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
	except ImportError:
	print("Warning: SNAC model import failed. Make sure SNAC is installed.")
	snac_model.to("cpu")

	print("Models loaded successfully!")
	def redistribute_codes(code_list):
	"""Redistribute generated codes into hierarchical layers for audio synthesis."""
	layer_1 = []
	layer_2 = []
	layer_3 = []

	for i in range((len(code_list)+1)//7):
	layer_1.append(code_list[7*i])
	layer_2.append(code_list[7*i+1]-4096)
	layer_3.append(code_list[7i+2]-(24096))
	layer_3.append(code_list[7i+3]-(34096))
	layer_2.append(code_list[7i+4]-(44096))
	layer_3.append(code_list[7i+5]-(54096))
	layer_3.append(code_list[7i+6]-(64096))

	codes = [torch.tensor(layer_1).unsqueeze(0),
	torch.tensor(layer_2).unsqueeze(0),
	torch.tensor(layer_3).unsqueeze(0)]

	audio_hat = snac_model.decode(codes)
	return audio_hat
	def sanskrit_tts_inference(sanskrit_text, chosen_voice=""):
	"""
	Generate Sanskrit speech from input text using the fine-tuned model.

	Args:
	sanskrit_text (str): Input Sanskrit text in Devanagari script
	chosen_voice (str): Voice selection parameter (optional)

	Returns:
	tuple: (audio_data, status_message)
	"""
	if not sanskrit_text.strip():
	return None, "Please enter some Sanskrit text."

	try:
	prompts = [sanskrit_text]
	chosen_voice = 1070

	# Prepare prompts with voice selection
	prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

	# Tokenize input prompts
	all_input_ids = []
	for prompt in prompts_:
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids
	all_input_ids.append(input_ids)

	# Define special tokens
	start_token = torch.tensor([[ 128259]], dtype=torch.int64)
	end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)

	# Construct modified input sequences
	all_modified_input_ids = []
	for input_ids in all_input_ids:
	modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
	all_modified_input_ids.append(modified_input_ids)

	# Apply padding and create attention masks
	all_padded_tensors = []
	all_attention_masks = []
	max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])

	for modified_input_ids in all_modified_input_ids:
	padding = max_length - modified_input_ids.shape[1]
	padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
	attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
	all_padded_tensors.append(padded_tensor)
	all_attention_masks.append(attention_mask)

	# Batch tensors for inference
	all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
	all_attention_masks = torch.cat(all_attention_masks, dim=0)

	input_ids = all_padded_tensors.to(device)
	attention_mask = all_attention_masks.to(device)

	# Generate audio codes using the model
	generated_ids = model.generate(
	input_ids=input_ids,
	attention_mask=attention_mask,
	max_new_tokens=1200,
	do_sample=True,
	temperature=0.6,
	top_p=0.95,
	repetition_penalty=1.1,
	num_return_sequences=1,
	eos_token_id=128258,
	use_cache=True
	)

	# Post-process generated tokens
	token_to_find = 128257
	token_to_remove = 128258

	token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

	if len(token_indices[1]) > 0:
	last_occurrence_idx = token_indices[1][-1].item()
	cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
	else:
	cropped_tensor = generated_ids

	mask = cropped_tensor != token_to_remove

	processed_rows = []
	for row in cropped_tensor:
	masked_row = row[row != token_to_remove]
	processed_rows.append(masked_row)

	# Convert tokens to audio codes
	code_lists = []
	for row in processed_rows:
	row_length = row.size(0)
	new_length = (row_length // 7) * 7
	trimmed_row = row[:new_length]
	trimmed_row = [t - 128266 for t in trimmed_row]
	code_lists.append(trimmed_row)

	# Generate audio samples
	my_samples = []
	for code_list in code_lists:
	samples = redistribute_codes(code_list)
	my_samples.append(samples)

	if len(my_samples) > 0:
	audio_sample = my_samples[0].detach().squeeze().to("cpu").numpy()
	return (24000, audio_sample), f"✅ Generated audio for: {sanskrit_text}"
	else:
	return None, "❌ Failed to generate audio - no valid codes produced."

	except Exception as e:
	return None, f"❌ Error during inference: {str(e)}"
	# Initialize models
	print("Loading models... This may take a moment.")
	load_models()
	# Create Gradio interface
	with gr.Blocks(title="Sanskrit Text-to-Speech") as demo:
	gr.Markdown("""
	# 🕉️ Sanskrit Text-to-Speech

	Enter Sanskrit text in Devanagari script and generate speech using your fine-tuned model.
	""")

	with gr.Row():
	with gr.Column():
	sanskrit_input = gr.Textbox(
	label="Sanskrit Text",
	placeholder="Enter Sanskrit text in Devanagari script...",
	lines=3,
	value="नमस्ते"
	)

	generate_btn = gr.Button("🎵 Generate Speech", variant="primary")

	with gr.Column():
	audio_output = gr.Audio(
	label="Generated Sanskrit Speech",
	type="numpy"
	)

	status_output = gr.Textbox(
	label="Status",
	lines=2,
	interactive=False
	)

	# Example inputs for demonstration
	gr.Examples(
	examples=[
	["नमस्ते"],
	["संस्कृत एक प्राचीन भाषा है"],
	["ॐ शान्ति शान्ति शान्तिः"],
	["सर्वे भवन्तु सुखिनः"],
	],
	inputs=[sanskrit_input],
	outputs=[audio_output, status_output],
	fn=sanskrit_tts_inference,
	cache_examples=False
	)

	# Connect interface components
	generate_btn.click(
	fn=sanskrit_tts_inference,
	inputs=[sanskrit_input],
	outputs=[audio_output, status_output]
	)
	# Launch the application
	if __name__ == "__main__":
	demo.launch(
	share=True,
	server_name="0.0.0.0",
	server_port=7860,
	show_error=True
	)
	```
	## 🔊 Demo Outputs
	<table>
	<tr>
	<td><strong>� यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।</strong><br/><em>Bhagavad Gita 4.7</em></td>
	<td>
	<audio controls>
	<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/यदा यदा हि धर्मस्य ग्लानिर्भवति भारत।.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	</tr>
	<tr>
	<td><strong>🕉️ कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।</strong><br/><em>Bhagavad Gita 2.47</em></td>
	<td>
	<audio controls>
	<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	</tr>
	<tr>
	<td><strong>📚 विद्या ददाति विनयं</strong><br/><em>Subhashita</em></td>
	<td>
	<audio controls>
	<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/विद्या ददाति विनयं.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	</tr>
	<tr>
	<td><strong>🌟 तमसो मा ज्योतिर्गमय।</strong><br/><em>Brihadaranyaka Upanishad 1.3.28</em></td>
	<td>
	<audio controls>
	<source src="https://huggingface.co/R910/Sanskrit_TTS_v2/resolve/main/तमसो मा ज्योतिर्गमय।.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	</tr>
	</table>


	## Model Information

	Developer: R910
	License: Apache 2.0
	Base Architecture: Fine-tuned from unsloth/orpheus-3b-0.1-ft

	This model has been optimized using Unsloth's efficient training framework, achieving 2x faster training speeds compared to standard implementations, in conjunction with Hugging Face's TRL (Transformer Reinforcement Learning) library.

	## Citation

	If you use this model or the training data, please cite:

	```bibtex
	@inproceedings{indictts,
	title = {Building Open Sourced and Industry Grade Low-Resource {TTS} for {I}ndian Languages},
	author = {ID Prakashraj and Abhayjeet Singh and Anusha Prakash and AV Anand Kumar and Shambavi Bhaskar
	and Varun Srinivas and Vishal Sunder and Hema A Murthy and S Umesh},
	booktitle = {Proc. Interspeech 2023},
	year = {2023},
	pages = {1009--1013},
	doi = {10.21437/Interspeech.2023-1339}
	}
	```

	Dataset source: [IIT Madras IndicTTS Database](https://www.iitm.ac.in/donlab/indictts/database)

	## Acknowledgments

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

	Special thanks to the [Speech and Language Technology Group, IIT Madras](https://www.iitm.ac.in/donlab/indictts/) for providing the Sanskrit TTS dataset.

	## Technical Specifications

	- Model Type: Fine-tuned Language Model for Text-to-Speech
	- Architecture: LLaMA-based with LoRA adaptation
	- Audio Output: 24kHz sampling rate
	- Maximum Sequence Length: 2048 tokens
	- Supported Script: Devanagari (Sanskrit)
	- Training Framework: Unsloth + Hugging Face TRL

	## Usage Requirements

	- Hardware: CUDA-compatible GPU
	- Dependencies: PyTorch 2.4.1+, Transformers, SNAC audio codec
	- Python Version: 3.7+