adityachhabra commited on
Commit
ea6f6fa
·
1 Parent(s): 91c214f

Update README.md and create Gradio app for testing

Browse files
Files changed (3) hide show
  1. README.md +46 -5
  2. app.py +182 -0
  3. requirements.txt +5 -0
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
- title: Svara Tts
3
- emoji: 👁
4
- colorFrom: pink
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
@@ -10,4 +10,45 @@ pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Svara TTS
3
+ emoji: 🗣️
4
+ colorFrom: blue
5
+ colorTo: violet
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
 
10
  license: apache-2.0
11
  ---
12
 
13
+ # 🗣️ Svara-TTS
14
+
15
+ **An open multilingual text-to-speech (TTS) model bringing expressive, human-like speech to India’s languages.**
16
+
17
+ Svara-TTS is trained on **1,900+ hours** of diverse Indian speech across **17 languages**, with balanced male and female voices.
18
+ It captures **emotion**, **tone**, and **rhythm**, producing natural, human-like speech for real-world Indian use cases.
19
+
20
+ > 💡 Supports `<happy>`, `<sad>`, `<fear>`, `<relief>`, and other emotion tags — plus multilingual and zero-shot voice cloning.
21
+
22
+ ---
23
+
24
+ ## 🌍 Supported Languages
25
+
26
+ Hindi, Marathi, Tamil, Telugu, Bengali, Kannada, Gujarati, Malayalam, Punjabi, Assamese, Odia, Bhojpuri, Chhattisgarhi, Maithili, Magahi, Nepali, and Indian English.
27
+
28
+ ---
29
+
30
+ ## 🎧 Try These
31
+
32
+ Hindi (Female): अरे वाह! आज तो मौसम बहुत ही सुहावना लग रहा है।
33
+ Tamil (Male): இன்று அலுவலகத்தில் பெரிய கூட்டம் இருந்தது, ஆனா எல்லாம் நன்றாக முடிந்தது.
34
+ Telugu (Female): నిజం చెప్పాలంటే, ఈ రోజు కొంచెం టెన్షన్ గా ఉంది… కానీ అన్ని బాగానే జరుగుతాయని నమ్మకం ఉంది.
35
+ English (Female): Sometimes it’s not about being perfect… it’s about being real.
36
+
37
+ ---
38
+
39
+ ### ⚙️ Technical Overview
40
+
41
+ - Model: **kenpath/svara-tts-v1**
42
+ - Codec: [SNAC 24kHz](https://huggingface.co/hubertsiuzdak/snac_24khz)
43
+ - Sampling rate: **24,000 Hz**
44
+ - Framework: **PyTorch + Transformers**
45
+
46
+ ---
47
+
48
+ ### 🧩 About
49
+
50
+ Built by [**Kenpath Technologies**](https://kenpath.ai)
51
+ Trained on open datasets including SYSPIN, SPICOR, RASA, and IndicTTS
52
+ Released under **Apache-2.0 License**
53
+
54
+ > “Svara is how India sounds — open, expressive, and human.”
app.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spaces
2
+ from snac import SNAC
3
+ import torch
4
+ import gradio as gr
5
+ from transformers import AutoModelForCausalLM, AutoTokenizer
6
+ from huggingface_hub import snapshot_download
7
+ from dotenv import load_dotenv
8
+ load_dotenv()
9
+
10
+ device = "cuda" if torch.cuda.is_available() else "cpu"
11
+
12
+ print("Loading SNAC model...")
13
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device)
14
+
15
+ model_name = "kenpath/svara-tts-v1"
16
+
17
+ snapshot_download(
18
+ repo_id=model_name,
19
+ allow_patterns=["config.json", "*.safetensors", "model.safetensors.index.json"],
20
+ ignore_patterns=["optimizer.pt", "pytorch_model.bin", "training_args.bin", "scheduler.pt"]
21
+ )
22
+
23
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
24
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
25
+ print(f"Svara model loaded to {device}")
26
+
27
+ # --------------------------
28
+ # Language and Gender Setup
29
+ # --------------------------
30
+ LANGUAGES = {
31
+ "Assamese (অসমীয়া)": "Assamese",
32
+ "Bengali (বাংলা)": "Bengali",
33
+ "Bhojpuri (भोजपुरी)": "Bhojpuri",
34
+ "Chhattisgarhi (छत्तीसगढ़ी)": "Chhattisgarhi",
35
+ "Gujarati (ગુજરાતી)": "Gujarati",
36
+ "Hindi (हिन्दी)": "Hindi",
37
+ "Kannada (ಕನ್ನಡ)": "Kannada",
38
+ "Maithili (मैथिली)": "Maithili",
39
+ "Magahi (मगही)": "Magahi",
40
+ "Malayalam (മലയാളം)": "Malayalam",
41
+ "Marathi (मराठी)": "Marathi",
42
+ "Nepali (नेपाली)": "Nepali",
43
+ "Odia (ଓଡ଼ିଆ)": "Odia",
44
+ "Punjabi (ਪੰਜਾਬੀ)": "Punjabi",
45
+ "Tamil (தமிழ்)": "Tamil",
46
+ "Telugu (తెలుగు)": "Telugu",
47
+ "English (Indian)": "English"
48
+ }
49
+ GENDERS = ["Male", "Female"]
50
+
51
+ # Emotion tags for user help text
52
+ EMOTIVE_TAGS = ["<happy>", "<sad>", "<fear>", "<surprise>", "<calm>", "<angry>", "<relief>", "<hopeful>", "<thoughtful>"]
53
+
54
+ # --------------------------
55
+ # Prompt Preparation
56
+ # --------------------------
57
+ def process_prompt(language, gender, text, tokenizer, device):
58
+ lang_label = LANGUAGES.get(language, "English")
59
+ prompt = f"{lang_label} ({gender}): {text}"
60
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
61
+ start_token = torch.tensor([[128259]], dtype=torch.int64)
62
+ end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
63
+ modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
64
+ attention_mask = torch.ones_like(modified_input_ids)
65
+ return modified_input_ids.to(device), attention_mask.to(device)
66
+
67
+ # --------------------------
68
+ # Generation Functions
69
+ # --------------------------
70
+ def parse_output(generated_ids):
71
+ token_to_find, token_to_remove = 128257, 128258
72
+ token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
73
+ cropped_tensor = generated_ids[:, token_indices[1][-1] + 1:] if len(token_indices[1]) > 0 else generated_ids
74
+ processed_rows = [row[row != token_to_remove] for row in cropped_tensor]
75
+ row = processed_rows[0]
76
+ trimmed_row = row[: (row.size(0) // 7) * 7]
77
+ trimmed_row = [t - 128266 for t in trimmed_row]
78
+ return trimmed_row
79
+
80
+ def redistribute_codes(code_list, snac_model):
81
+ layer_1, layer_2, layer_3 = [], [], []
82
+ for i in range((len(code_list) + 1) // 7):
83
+ layer_1.append(code_list[7*i])
84
+ layer_2.append(code_list[7*i+1]-4096)
85
+ layer_3.append(code_list[7*i+2]-(2*4096))
86
+ layer_3.append(code_list[7*i+3]-(3*4096))
87
+ layer_2.append(code_list[7*i+4]-(4*4096))
88
+ layer_3.append(code_list[7*i+5]-(5*4096))
89
+ layer_3.append(code_list[7*i+6]-(6*4096))
90
+ codes = [torch.tensor(x, device=device).unsqueeze(0) for x in [layer_1, layer_2, layer_3]]
91
+ return snac_model.decode(codes).detach().squeeze().cpu().numpy()
92
+
93
+ @spaces.GPU()
94
+ def generate_speech(language, gender, text, temperature, top_p, repetition_penalty, max_new_tokens, progress=gr.Progress()):
95
+ if not text.strip():
96
+ return None
97
+ try:
98
+ progress(0.1, "Processing text...")
99
+ input_ids, attention_mask = process_prompt(language, gender, text, tokenizer, device)
100
+ progress(0.3, "Generating speech tokens...")
101
+ with torch.no_grad():
102
+ generated_ids = model.generate(
103
+ input_ids=input_ids, attention_mask=attention_mask,
104
+ max_new_tokens=max_new_tokens, do_sample=True,
105
+ temperature=temperature, top_p=top_p,
106
+ repetition_penalty=repetition_penalty,
107
+ num_return_sequences=1, eos_token_id=128258
108
+ )
109
+ progress(0.6, "Parsing output...")
110
+ code_list = parse_output(generated_ids)
111
+ progress(0.8, "Converting to audio...")
112
+ audio_samples = redistribute_codes(code_list, snac_model)
113
+ return (24000, audio_samples)
114
+ except Exception as e:
115
+ print(f"Error generating speech: {e}")
116
+ return None
117
+
118
+ # --------------------------
119
+ # Example Prompts
120
+ # --------------------------
121
+ examples = [
122
+ ["Hindi (हिन्दी)", "Female", "अरे वाह! आज तो मौसम बहुत ही सुहावना लग रहा है। <happy>", 0.6, 0.95, 1.1, 1200],
123
+ ["Marathi (मराठी)", "Male", "खरंच सांगतो, आजचा दिवस खूप छान गेला! <happy>", 0.6, 0.95, 1.1, 1200],
124
+ ["Tamil (தமிழ்)", "Male", "இன்று அலுவலகத்தில் பெரிய கூட்டம் இருந்தது, ஆனா எல்லாம் நன்றாக முடிந்தது. <relief>", 0.65, 0.95, 1.1, 1200],
125
+ ["Bengali (বাংলা)", "Female", "আজ অফিসে এত কাজ ছিল যে মাথা ধরেছে! একটু বিশ্রাম নেব ভাবছি। <sad>", 0.7, 0.9, 1.1, 1200],
126
+ ["Telugu (తెలుగు)", "Female", "నిజం చెప్పాలంటే, ఈ రోజు కొంచెం టెన్షన్ గా ఉంది... కానీ అన్ని బాగానే జరుగుతాయని నమ్మకం ఉంది. <fear>", 0.7, 0.95, 1.1, 1200],
127
+ ["Gujarati (ગુજરાતી)", "Female", "અરે વાહ, આજે તો પૂરો દિવસ ફક્ત વરસાદ જ પડ્યો! ચા પી લઈએ ને? <joy>", 0.6, 0.95, 1.1, 1200],
128
+ ["Kannada (ಕನ್ನಡ)", "Male", "ಹೌದು, ನಿನ್ನೆ ರಾತ್ರಿ ತುಂಬಾ ಮಳೆ ಬಿತ್ತು. ಬೆಳಿಗ್ಗೆ ರಸ್ತೆಗಳಲ್ಲಿ ನೀರು ತುಂಬಿದೆ. <surprise>", 0.65, 0.95, 1.1, 1200],
129
+ ["English (Indian)", "Female", "Sometimes it's not about being perfect... it's about being real, and just a little bit messy. <reflective>", 0.6, 0.95, 1.1, 1200],
130
+ ]
131
+
132
+ # --------------------------
133
+ # Gradio UI
134
+ # --------------------------
135
+ with gr.Blocks(title="Svara Multilingual TTS") as demo:
136
+ gr.Markdown(f"""
137
+ # 🎵 Svara-TTS
138
+ *An open multilingual TTS model for expressive, human-like speech across India's languages.*
139
+
140
+ **Tips:**
141
+ - Add emotion tags like {", ".join(EMOTIVE_TAGS)} for expressive speech.
142
+ - Use longer, natural sentences for better prosody.
143
+ - Choose a language and gender before generating.
144
+ """)
145
+
146
+ with gr.Row():
147
+ with gr.Column(scale=3):
148
+ lang = gr.Dropdown(choices=list(LANGUAGES.keys()), value="Hindi (हिन्दी)", label="Language")
149
+ gender = gr.Dropdown(choices=GENDERS, value="Female", label="Gender")
150
+ text_input = gr.Textbox(label="Text to speak", placeholder="Type your text...", lines=5)
151
+
152
+ with gr.Accordion("Advanced Settings", open=False):
153
+ temperature = gr.Slider(0.1, 1.5, 0.6, 0.05, label="Temperature")
154
+ top_p = gr.Slider(0.1, 1.0, 0.95, 0.05, label="Top-p")
155
+ repetition_penalty = gr.Slider(1.0, 2.0, 1.1, 0.05, label="Repetition Penalty")
156
+ max_new_tokens = gr.Slider(100, 2000, 1200, 100, label="Max Tokens")
157
+
158
+ with gr.Row():
159
+ submit = gr.Button("Generate Speech", variant="primary")
160
+ clear = gr.Button("Clear")
161
+
162
+ with gr.Column(scale=2):
163
+ audio_output = gr.Audio(label="Generated Speech", type="numpy")
164
+
165
+ gr.Examples(
166
+ examples=examples,
167
+ inputs=[lang, gender, text_input, temperature, top_p, repetition_penalty, max_new_tokens],
168
+ outputs=audio_output,
169
+ fn=generate_speech,
170
+ cache_examples=True,
171
+ )
172
+
173
+ submit.click(
174
+ fn=generate_speech,
175
+ inputs=[lang, gender, text_input, temperature, top_p, repetition_penalty, max_new_tokens],
176
+ outputs=audio_output,
177
+ )
178
+
179
+ clear.click(fn=lambda: (None, None), inputs=[], outputs=[text_input, audio_output])
180
+
181
+ if __name__ == "__main__":
182
+ demo.queue().launch(share=False, ssr_mode=False)
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ snac
2
+ python-dotenv
3
+ transformers
4
+ torch
5
+ spaces