austinekurian commited on
Commit
ea79f5f
·
verified ·
1 Parent(s): 38e9761

Upload 5 files

Browse files
Files changed (5) hide show
  1. LICENSE +21 -0
  2. README.md +45 -12
  3. app.py +200 -0
  4. packages.txt +1 -0
  5. requirements.txt +6 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,12 +1,45 @@
1
- ---
2
- title: AIapps
3
- emoji: 📈
4
- colorFrom: purple
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 6.1.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # മലയാളം Text → AI Voice (Free)
2
+
3
+ A free web app (Hugging Face Space, Gradio) that converts **Malayalam** text to speech using the **AI4Bharat VITS** model.
4
+
5
+ ## How it works
6
+ - Loads the multi‑lingual Indian **VITS TTS** model `ai4bharat/vits_rasa_13`, which includes **Malayalam** voices and multiple **styles** (NEWS, BOOK, etc.).
7
+ - Renders a simple Gradio UI: paste Malayalam text → click **Generate** → download audio.
8
+
9
+ > Model reference: AI4Bharat VITS model with Malayalam support and style/speaker IDs.
10
+ > Piper/Sherpa‑ONNX alternative for Malayalam also exists (`ml_IN-arjun`), if you prefer an ONNX path.
11
+
12
+ ## Deploy (Hugging Face Spaces)
13
+ 1. Create a new Space → **Gradio**.
14
+ 2. Upload these files: `app.py`, `requirements.txt`, `README.md`.
15
+ 3. The Space will build and start automatically.
16
+ 4. Share the public URL.
17
+
18
+ ## Usage
19
+ - Default speaker is **MAL_F (11)**.
20
+ - Try styles like **NEWS (10)** for crisp reading, **BOOK (3)** for long‑form, **ALEXA (0)** for neutral.
21
+
22
+ ## Local run (optional)
23
+ ```bash
24
+ python -m venv .venv && source .venv/bin/activate
25
+ pip install -r requirements.txt
26
+ python app.py
27
+ ```
28
+
29
+ ## Licensing
30
+ - App code: MIT (see below).
31
+ - **Model license**: please review the license on the model page before commercial use.
32
+
33
+ ### MIT License (app code)
34
+ Copyright (c) 2025
35
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction...
36
+ ```
37
+ (standard MIT terms)
38
+ ```
39
+
40
+
41
+ ## New features
42
+ - **Prosody sliders:** speaking rate (0.5–1.5) & pitch (−4…+4 semitones). Implemented via resampling (approximate).
43
+ - **Batch paragraphs:** split on blank lines → one file per paragraph × style.
44
+ - **MP3 alongside WAV:** via `pydub` + ffmpeg (present on Spaces). Falls back to WAV if MP3 fails.
45
+
app.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py
2
+ # Malayalam TTS (Free) – Multi-style, Prosody (rate & pitch), Batch paragraphs, WAV+MP3
3
+ # Model: AI4Bharat VITS (supports Malayalam among 13 Indian languages)
4
+
5
+ import gradio as gr
6
+ import soundfile as sf
7
+ import tempfile
8
+ import torch
9
+ from transformers import AutoModel, AutoTokenizer
10
+ import numpy as np
11
+ import os
12
+
13
+ # Optional MP3 conversion
14
+ try:
15
+ from pydub import AudioSegment
16
+ _HAS_PYDUB = True
17
+ except Exception:
18
+ _HAS_PYDUB = False
19
+
20
+ MODEL_ID = "ai4bharat/vits_rasa_13"
21
+
22
+ device = "cuda" if torch.cuda.is_available() else "cpu"
23
+ model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(device)
24
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
25
+
26
+ DEFAULT_SPEAKER = 11 # MAL_F
27
+ DEFAULT_TEXT = (
28
+ "മലയാളം ടെക്സ്റ്റ് ശബ്ദമായി മാറ്റാൻ ഇതുപയോഗിക്കുക. താഴെ ഒരു ഉദാഹരണം നൽകുന്നു.\n\n"
29
+ "ഇത് ഒരു രണ്ടാം പാരագրാഫ് ആണ്."
30
+ )
31
+
32
+ STYLE_LABELS = {
33
+ 0: "ALEXA",
34
+ 1: "ANGER",
35
+ 2: "BB",
36
+ 3: "BOOK",
37
+ 4: "CONV",
38
+ 5: "DIGI",
39
+ 6: "DISGUST",
40
+ 7: "FEAR",
41
+ 8: "HAPPY",
42
+ 10: "NEWS",
43
+ 12: "SAD",
44
+ 14: "SURPRISE",
45
+ 15: "UMANG",
46
+ 16: "WIKI",
47
+ }
48
+
49
+
50
+ def split_paragraphs(text: str):
51
+ # Split on blank lines; ignore empty chunks
52
+ parts = [p.strip() for p in text.replace('\r','').split('\n\n')]
53
+ parts = [p for p in parts if p]
54
+ return parts if parts else ([text.strip()] if text.strip() else [])
55
+
56
+
57
+ def time_scale(wav: np.ndarray, rate: float) -> np.ndarray:
58
+ """Naive time scaling by linear interpolation. rate>1 -> faster (shorter)."""
59
+ if rate <= 0:
60
+ rate = 1.0
61
+ if abs(rate - 1.0) < 1e-6:
62
+ return wav
63
+ n = len(wav)
64
+ new_len = max(1, int(n / rate))
65
+ x_old = np.linspace(0.0, 1.0, n, endpoint=False)
66
+ x_new = np.linspace(0.0, 1.0, new_len, endpoint=False)
67
+ return np.interp(x_new, x_old, wav).astype(wav.dtype)
68
+
69
+
70
+ def apply_prosody(wav: np.ndarray, sr: int, rate: float, pitch_semitones: float):
71
+ """
72
+ Approximate prosody control without heavy DSP:
73
+ - We implement pitch by changing the output *sample rate* by factor pf = 2**(semitones/12).
74
+ - Changing sample rate also changes playback speed by pf, so we pre-scale time by rate/pf
75
+ to keep the final perceived speaking rate close to the requested rate.
76
+ """
77
+ pf = 2.0 ** (pitch_semitones / 12.0)
78
+ pre_rate = max(0.25, min(4.0, rate / max(pf, 1e-6)))
79
+ y = time_scale(wav, pre_rate)
80
+ out_sr = int(sr * pf)
81
+ return y, out_sr
82
+
83
+
84
+ def synthesize_once(text: str, speaker_id: int, style_id: int):
85
+ inputs = tokenizer(text=text, return_tensors="pt").to(device)
86
+ outputs = model(inputs['input_ids'], speaker_id=int(speaker_id), emotion_id=int(style_id))
87
+ wav = outputs.waveform.squeeze().detach().cpu().numpy()
88
+ sr = model.config.sampling_rate
89
+ return wav, sr
90
+
91
+
92
+ def save_audio_pair(wav: np.ndarray, sr: int, base_name: str, make_mp3: bool):
93
+ # Save WAV
94
+ wav_path = base_name + ".wav"
95
+ sf.write(wav_path, wav, sr)
96
+ out_files = [wav_path]
97
+ # Optionally save MP3 via pydub/ffmpeg
98
+ if make_mp3 and _HAS_PYDUB:
99
+ try:
100
+ mp3_path = base_name + ".mp3"
101
+ seg = AudioSegment.from_wav(wav_path)
102
+ seg.export(mp3_path, format="mp3")
103
+ out_files.append(mp3_path)
104
+ except Exception:
105
+ pass
106
+ return out_files
107
+
108
+
109
+ def parse_style(choice: str) -> int:
110
+ try:
111
+ return int(choice.split(":", 1)[0])
112
+ except Exception:
113
+ return 0
114
+
115
+
116
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
117
+ gr.Markdown(
118
+ """
119
+ # മലയാളം Text → AI Voice (Free)
120
+ Open‑source Malayalam TTS powered by **AI4Bharat VITS**.
121
+ Now supports **multiple voice styles**, **prosody (rate & pitch)**, **batch paragraphs**, and **WAV + MP3** output.
122
+ """
123
+ )
124
+
125
+ with gr.Row():
126
+ txt = gr.Textbox(label="Malayalam Text (single or multiple paragraphs)", value=DEFAULT_TEXT, lines=8, placeholder="ഒരു അല്ലെങ്കിൽ നിരവധി പാരഗ്രാഫുകൾ ഇവിടെ പേസ്റ്റ് ചെയ്യുക… രണ്ട് newline ഉപയോഗിച്ച് വേർതിരിക്കുക.")
127
+ with gr.Row():
128
+ speaker = gr.Slider(0, 19, value=DEFAULT_SPEAKER, step=1, label="Speaker ID (MAL_F = 11)")
129
+ styles = gr.CheckboxGroup(
130
+ choices=[f"{k}:{v}" for k, v in STYLE_LABELS.items()],
131
+ value=["0:ALEXA", "10:NEWS", "3:BOOK"],
132
+ label="Voice styles (select one or more)"
133
+ )
134
+ with gr.Row():
135
+ rate = gr.Slider(minimum=0.5, maximum=1.5, value=1.0, step=0.05, label="Speaking rate (0.5–1.5)")
136
+ pitch = gr.Slider(minimum=-4, maximum=+4, value=0, step=1, label="Pitch (semitones, -4 to +4)")
137
+ batch = gr.Checkbox(value=True, label="Batch: split by blank lines (paragraphs)")
138
+ make_mp3 = gr.Checkbox(value=True, label="Also export MP3 (needs ffmpeg)")
139
+ with gr.Row():
140
+ btn = gr.Button("Generate", variant="primary")
141
+ audio = gr.Audio(label="Preview (first file)", type="filepath")
142
+ files_out = gr.Files(label="All generated files")
143
+ note = gr.Markdown()
144
+
145
+ def run(text, speaker_id, style_choices, rate, pitch, batch, make_mp3):
146
+ text = (text or "").strip()
147
+ if not text:
148
+ raise gr.Error("ദയവായി മലയാളത്തിൽ ഒരു വാചകം/പാരഗ്രാഫ് നൽകുക.")
149
+ paras = split_paragraphs(text) if batch else [text]
150
+ if not style_choices:
151
+ style_choices = ["0:ALEXA"]
152
+ total = len(paras) * len(style_choices)
153
+ if total > 30:
154
+ raise gr.Error(f"താങ്കൾ വളരെ കൂടുതൽ ഔട്ട്‌പുട്ടുകൾ ആവശ്യപ്പെടുന്നു ({total}). ദയവായി കുറച്ച് പാരഗ്രാഫുകൾ/സ്റ്റൈലുകൾ തിരഞ്ഞെടുക്കുക (<= 30 files).")
155
+
156
+ all_files = []
157
+ preview = None
158
+ details = []
159
+ idx = 1
160
+ for pi, para in enumerate(paras, start=1):
161
+ wav_raw, sr_raw = synthesize_once(para, int(speaker_id), parse_style(style_choices[0])) # synthesize once per paragraph using first style to get base prosody; style will be applied per file below anyway
162
+ for sc in style_choices:
163
+ stid = parse_style(sc)
164
+ # Re-synthesize for each style to reflect emotion_id
165
+ wav, sr = synthesize_once(para, int(speaker_id), stid)
166
+ # Apply prosody approximation
167
+ wav2, sr2 = apply_prosody(wav, sr, float(rate), float(pitch))
168
+ base = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name[:-4]
169
+ base_named = f"{base}_p{pi:02d}_style-{stid}_{STYLE_LABELS.get(stid, 'STYLE')}"
170
+ outs = save_audio_pair(wav2, sr2, base_named, bool(make_mp3))
171
+ all_files.extend(outs)
172
+ if preview is None:
173
+ preview = outs[0]
174
+ details.append(f"• P{pi} – {STYLE_LABELS.get(stid, sc)} → {os.path.basename(outs[0])}{' (+MP3)' if len(outs)>1 else ''}")
175
+ idx += 1
176
+
177
+ summary = (
178
+ f"Generated **{len(all_files)}** files for {len(paras)} paragraph(s) × {len(style_choices)} style(s).\n\n"
179
+ + "\n".join(details)
180
+ + ("\n\n**Note:** Rate & pitch are approximations using resampling; for studio-grade SSML prosody use a managed TTS like Azure." if True else "")
181
+ )
182
+ return preview, all_files, summary
183
+
184
+ btn.click(run, inputs=[txt, speaker, styles, rate, pitch, batch, make_mp3], outputs=[audio, files_out, note])
185
+
186
+ gr.Markdown(
187
+ """
188
+ **Prosody controls**
189
+ *Speaking rate* slows/speeds audio; *Pitch* raises/lowers tone (in semitones). These are **approximate** controls based on resampling. For high‑fidelity prosody, consider SSML in Azure TTS.
190
+
191
+ **Batch mode**
192
+ Split input into paragraphs using a blank line. The app creates one file per **paragraph × style**.
193
+
194
+ **MP3 output**
195
+ Requires `ffmpeg` (available on Hugging Face Spaces). If unavailable, only WAV will be produced.
196
+ """
197
+ )
198
+
199
+ if __name__ == "__main__":
200
+ demo.launch()
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ ffmpeg
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ gradio==4.44.0
2
+ transformers
3
+ torch
4
+ soundfile
5
+
6
+ pydub