Complete Solution: Advanced TTS with Real Voices + Voice Cloning

#12
by masbudjj - opened
Files changed (1) hide show
  1. README.md +275 -126
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Multi-Voice TTS - 24 Unique Voices
3
  emoji: πŸŽ™οΈ
4
  colorFrom: indigo
5
  colorTo: purple
@@ -8,178 +8,317 @@ pinned: false
8
  license: apache-2.0
9
  ---
10
 
11
- # πŸŽ™οΈ Multi-Voice Text-to-Speech
12
 
13
- **24 Unique Voices - 100% Browser-Based - No Server Required**
14
 
15
- ## ✨ Features
16
 
17
- ### 🎭 24 Unique Voice Characters
18
 
19
- #### πŸ‡ΊπŸ‡Έ American Female (6 voices)
20
- - **Default** - Neutral baseline
21
- - **Warm** - Friendly & caring
22
- - **Bright** - Energetic & happy
23
- - **Soft** - Gentle & calm
24
- - **Clear** - Professional
25
- - **Smooth** - Elegant
26
 
27
- #### πŸ‡ΊπŸ‡Έ American Male (6 voices)
28
- - **Default** - Neutral baseline
29
- - **Deep** - Authoritative
30
- - **Friendly** - Approachable
31
- - **Strong** - Confident
32
- - **Calm** - Relaxed
33
- - **Professional** - Business-oriented
34
 
35
- #### πŸ‡¬πŸ‡§ British Female (4 voices)
36
- - **Refined** - Elegant
37
- - **Bright** - Cheerful
38
- - **Soft** - Gentle
39
- - **Clear** - Articulate
40
 
41
- #### πŸ‡¬πŸ‡§ British Male (4 voices)
42
- - **Distinguished** - Formal
43
- - **Smooth** - Sophisticated
44
- - **Warm** - Friendly
45
- - **Strong** - Commanding
 
 
46
 
47
- #### 🌏 International (4 voices)
48
- - **Neutral** - Standard
49
- - **Soft** - Gentle
50
- - **Clear** - Professional
51
- - **Warm** - Friendly
52
 
53
  ---
54
 
55
- ## 🎨 Voice Customization
56
 
57
- Each voice can be further customized with:
 
 
 
 
58
 
59
- - **Pitch Control** (0.5x - 1.5x) - Adjust voice pitch
60
- - **Energy Control** (0.5x - 1.5x) - Modify speaking energy
61
- - **Speed Control** (0.5x - 2.0x) - Playback speed
62
 
63
- **Total Combinations:** 24 voices Γ— unlimited pitch/energy variations = **Infinite possibilities!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ---
66
 
67
- ## πŸ—οΈ Technology
68
 
69
- ### Base Model
70
- - **SpeechT5** from Microsoft
71
- - **ONNX Runtime** for browser execution
72
- - **WebAssembly** backend
 
73
 
74
- ### Voice Generation
75
- Each of the 24 voices is created by:
76
- 1. Taking base speaker embedding (512-dim)
77
- 2. Applying pitch transformation
78
- 3. Modulating energy levels
79
- 4. Spectral shaping for character
80
- 5. Prosody adjustment
81
- 6. Normalization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ---
84
 
85
- ## πŸš€ Features
86
 
87
- βœ… **24 Unique Voices** - Diverse characters
88
- βœ… **100% Browser-Based** - No server needed
89
- βœ… **Voice Customization** - Pitch & energy controls
90
- βœ… **Fast Generation** - 2-5 seconds
91
- βœ… **High Quality** - SpeechT5 architecture
92
- βœ… **Offline Capable** - Works after first load
93
- βœ… **Privacy Focused** - No data sent to servers
94
- βœ… **Free & Open Source** - Apache 2.0 license
95
 
96
  ---
97
 
98
- ## πŸ’» How It Works
99
 
100
- ### Voice Profile System
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ```javascript
102
- const VOICE_PROFILES = {
103
- af_warm: {
104
- pitch: 0.95, // Slightly lower
105
- energy: 1.1, // More energetic
106
- spectral: 0.2 // Brighter tone
107
- },
108
- am_deep: {
109
- pitch: 0.7, // Much lower
110
- energy: 1.1, // Strong
111
- spectral: -0.5 // Darker tone
112
- },
113
- // ... 24 total profiles
114
- };
115
  ```
116
 
117
- ### Generation Process
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ```
119
- User Input Text
120
- ↓
121
- Select Voice Profile
122
- ↓
123
- Load Base Speaker Embedding
124
- ↓
125
- Apply Transformations:
126
- - Pitch modification
127
- - Energy modulation
128
- - Spectral shaping
129
- - User adjustments (pitch/energy sliders)
130
- ↓
131
- Normalize Embedding
132
- ↓
133
- SpeechT5 Generation
134
- ↓
135
- WAV Output
 
 
136
  ```
137
 
138
  ---
139
 
140
- ## 🎯 Use Cases
141
 
142
- **Professional/Corporate:**
143
- - af_clear, am_professional, bf_clear, bm_distinguished
 
 
 
 
 
144
 
145
- **Friendly/Casual:**
146
- - af_warm, am_friendly, bf_bright, int_warm
147
 
148
- **Storytelling/Narration:**
149
- - af_smooth, am_calm, bf_refined, bm_smooth
150
 
151
- **Energetic/Marketing:**
152
- - af_bright, am_strong, bf_bright
 
 
 
153
 
154
  ---
155
 
156
- ## πŸ“Š Comparison
157
 
158
- | Feature | This App | SpeechT5 Basic | Kokoro-82M |
159
- |---------|----------|----------------|------------|
160
- | **Voices** | 24 | 1 | 54 |
161
- | **Browser** | βœ… Yes | βœ… Yes | ❌ No |
162
- | **Customization** | βœ… Pitch/Energy | ❌ Limited | βœ… Yes |
163
- | **Server** | ❌ Not needed | ❌ Not needed | βœ… Required |
164
- | **Speed** | ⚑ Fast | ⚑ Fast | ⏱️ Medium |
 
 
165
 
166
  ---
167
 
168
- ## πŸ”§ Technical Details
169
 
170
- **Model:** Xenova/speecht5_tts
171
- **Size:** ~50MB (cached after first load)
172
- **Format:** ONNX (quantized)
173
- **Sample Rate:** 16kHz
174
- **Output:** WAV (16-bit PCM)
 
 
 
 
175
 
176
- **Voice Embedding:** 512-dimensional vector
177
- **Transformations:** Pitch, energy, spectral
178
- **Normalization:** Z-score (mean=0, std=1)
 
 
 
 
179
 
180
  ---
181
 
182
- ## πŸ“ License
183
 
184
  Apache 2.0 - Free for personal and commercial use
185
 
@@ -187,11 +326,21 @@ Apache 2.0 - Free for personal and commercial use
187
 
188
  ## πŸ™ Credits
189
 
190
- - **Base Model:** Microsoft SpeechT5
191
  - **ONNX Conversion:** Xenova/transformers.js
192
- - **Voice Profiles:** Custom implementation
193
- - **UI:** Modern glassmorphism design
 
 
 
 
 
 
 
 
 
 
194
 
195
  ---
196
 
197
- **Built with ❀️ using Transformers.js**
 
1
  ---
2
+ title: Advanced TTS - Real Voices + Voice Cloning
3
  emoji: πŸŽ™οΈ
4
  colorFrom: indigo
5
  colorTo: purple
 
8
  license: apache-2.0
9
  ---
10
 
11
+ # πŸŽ™οΈ Advanced Text-to-Speech System
12
 
13
+ **7 Authentic Voices + Voice Cloning + Unlimited Text - 100% Browser-Based**
14
 
15
+ ## ✨ Key Features
16
 
17
+ ### 🎭 Dual Voice Modes
18
 
19
+ #### πŸ“š Preset Voices (7 Authentic Speakers)
20
+ Real speaker embeddings from the CMU ARCTIC dataset:
 
 
 
 
 
21
 
22
+ **πŸ‡ΊπŸ‡Έ American Voices:**
23
+ - **Sarah (slt)** - Female, Clear & Professional
24
+ - **Clara (clb)** - Female, Warm & Friendly
25
+ - **Ben (bdl)** - Male, Deep & Authoritative
26
+ - **Robert (rms)** - Male, Calm & Relaxed
 
 
27
 
28
+ **🌍 International Voices:**
29
+ - **Andrew (awb)** - Scottish Male, Distinguished
30
+ - **James (jmk)** - Canadian Male, Friendly
31
+ - **Kiran (ksp)** - Indian Male, Professional
 
32
 
33
+ #### 🎀 Voice Cloning Mode
34
+ Upload your own voice sample (up to 1 minute) and the system will:
35
+ - Extract voice characteristics
36
+ - Auto-compress large files
37
+ - Resample to optimal quality (16kHz)
38
+ - Convert stereo to mono
39
+ - Generate 512-dim voice embedding
40
 
41
+ **Supported formats:** WAV, MP3
42
+ **Max duration:** 60 seconds (auto-trim)
43
+ **Processing:** Automatic compression & resampling
 
 
44
 
45
  ---
46
 
47
+ ## πŸ“ Unlimited Text Processing
48
 
49
+ ### Smart Chunking System
50
+ - **Automatic splitting** - Intelligently splits by sentences
51
+ - **200 chars per chunk** - Optimal for quality & speed
52
+ - **Seamless concatenation** - Merges all chunks into single audio
53
+ - **Real-time progress** - Track each chunk being processed
54
 
55
+ **No character limits!** Type as much text as you want.
 
 
56
 
57
+ ---
58
+
59
+ ## 🎨 Advanced Features
60
+
61
+ ### βš™οΈ Audio Controls
62
+ - **Speed Control** - 0.5x to 2.0x playback speed
63
+ - **Real-time adjustment** - Change speed during playback
64
+
65
+ ### πŸ“Š Live Monitoring
66
+ - **Character counter** - Total text length
67
+ - **Word counter** - Word count
68
+ - **Chunk calculator** - Estimated processing chunks
69
+ - **Progress bar** - Visual generation progress
70
+ - **Activity log** - Detailed processing steps
71
+
72
+ ### πŸ’Ύ Download & Playback
73
+ - **Browser audio player** - Built-in controls
74
+ - **WAV format** - High-quality 16-bit PCM
75
+ - **Download option** - Save generated audio
76
 
77
  ---
78
 
79
+ ## πŸ—οΈ Technical Architecture
80
 
81
+ ### Model & Runtime
82
+ - **Base Model:** Microsoft SpeechT5 (Xenova/speecht5_tts)
83
+ - **Runtime:** ONNX Runtime (WebAssembly)
84
+ - **Framework:** Transformers.js 3.1.2
85
+ - **Execution:** 100% client-side (no server)
86
 
87
+ ### Voice System
88
+ - **Speaker Embeddings:** 512-dimensional x-vectors
89
+ - **Dataset:** CMU ARCTIC (7 speakers)
90
+ - **Cloning:** Web Audio API + spectral analysis
91
+ - **Format:** Float32Array, normalized
92
+
93
+ ### Audio Processing
94
+ ```javascript
95
+ Input Audio
96
+ ↓
97
+ Duration Check (trim if > 60s)
98
+ ↓
99
+ Resample to 16kHz
100
+ ↓
101
+ Convert to Mono
102
+ ↓
103
+ Extract Features (mean, variance, spectral)
104
+ ↓
105
+ Generate 512-dim Embedding
106
+ ↓
107
+ Normalize (L2 norm)
108
+ ↓
109
+ Ready for TTS
110
+ ```
111
+
112
+ ### Text Processing Pipeline
113
+ ```javascript
114
+ User Input Text
115
+ ↓
116
+ Split by Sentences
117
+ ↓
118
+ Group into 200-char Chunks
119
+ ↓
120
+ Process Each Chunk:
121
+ - Generate with TTS
122
+ - Use selected voice embedding
123
+ - Update progress
124
+ ↓
125
+ Concatenate All Audio
126
+ ↓
127
+ Encode to WAV
128
+ ↓
129
+ Present to User
130
+ ```
131
+
132
+ ---
133
+
134
+ ## οΏ½οΏ½ How It Works
135
+
136
+ ### Preset Voice Generation
137
+ 1. Select voice from dropdown (e.g., "Sarah - Female")
138
+ 2. Enter text (unlimited length)
139
+ 3. Click "Generate Speech"
140
+ 4. System splits text into chunks
141
+ 5. Processes each chunk with selected voice
142
+ 6. Concatenates all audio
143
+ 7. Presents final WAV file
144
+
145
+ ### Voice Cloning Workflow
146
+ 1. Switch to "Voice Clone" mode
147
+ 2. Upload voice sample (WAV/MP3, max 60s)
148
+ 3. Click "Process Voice Sample"
149
+ 4. System extracts voice characteristics
150
+ 5. Enter text to generate
151
+ 6. Click "Generate Speech"
152
+ 7. Your voice clone reads the text!
153
+
154
+ ---
155
+
156
+ ## πŸ’» Browser Requirements
157
+
158
+ **Minimum Requirements:**
159
+ - Modern browser (Chrome 90+, Firefox 88+, Safari 14+)
160
+ - JavaScript enabled
161
+ - ~100MB RAM for model
162
+ - ~50MB storage for model cache
163
+
164
+ **Optimal Experience:**
165
+ - Chrome/Edge with WebGPU support
166
+ - 4GB+ RAM
167
+ - Fast internet (first load only)
168
 
169
  ---
170
 
171
+ ## πŸ“Š Performance
172
 
173
+ | Metric | Value |
174
+ |--------|-------|
175
+ | **Model Size** | ~50MB (cached after first load) |
176
+ | **Voice Load Time** | ~5-10s (first time only) |
177
+ | **Generation Speed** | ~2-5s per 200 chars |
178
+ | **Sample Rate** | 16kHz |
179
+ | **Audio Format** | WAV (16-bit PCM) |
180
+ | **Max Text Length** | Unlimited (chunked) |
181
 
182
  ---
183
 
184
+ ## 🎯 Use Cases
185
 
186
+ ### Professional
187
+ - **Corporate videos** - Ben (authoritative), Robert (calm)
188
+ - **Training materials** - Sarah (clear), Kiran (professional)
189
+ - **Presentations** - Clara (warm), James (friendly)
190
+
191
+ ### Creative
192
+ - **Audiobooks** - Andrew (distinguished), Robert (relaxed)
193
+ - **Podcasts** - Use voice cloning for consistency
194
+ - **Voice-overs** - Multiple character voices
195
+
196
+ ### Accessibility
197
+ - **Screen readers** - Clear, natural voices
198
+ - **Language learning** - Different accents
199
+ - **Content accessibility** - Convert text to audio
200
+
201
+ ---
202
+
203
+ ## πŸ”§ Technical Details
204
+
205
+ ### Voice Embedding Extraction (Cloning)
206
  ```javascript
207
+ // Simplified process
208
+ 1. Load audio file
209
+ 2. Decode to AudioBuffer
210
+ 3. Resample to 16kHz if needed
211
+ 4. Convert stereo β†’ mono
212
+ 5. Split into 512 chunks
213
+ 6. Calculate mean & variance per chunk
214
+ 7. Combine to create embedding
215
+ 8. Normalize (L2 norm = 1)
 
 
 
 
216
  ```
217
 
218
+ ### Chunking Algorithm
219
+ ```javascript
220
+ function chunkText(text, maxChars = 200) {
221
+ // Split by sentence boundaries
222
+ const sentences = text.match(/[^.!?]+[.!?]+/g);
223
+
224
+ // Group sentences into chunks ≀ maxChars
225
+ const chunks = [];
226
+ let currentChunk = "";
227
+
228
+ for (const sentence of sentences) {
229
+ if ((currentChunk + sentence).length <= maxChars) {
230
+ currentChunk += sentence;
231
+ } else {
232
+ chunks.push(currentChunk.trim());
233
+ currentChunk = sentence;
234
+ }
235
+ }
236
+
237
+ return chunks;
238
+ }
239
  ```
240
+
241
+ ### Audio Concatenation
242
+ ```javascript
243
+ function concatenateAudio(audioArrays, sampleRate) {
244
+ // Calculate total length
245
+ const totalLength = audioArrays.reduce((sum, arr) =>
246
+ sum + arr.length, 0);
247
+
248
+ // Merge all chunks
249
+ const result = new Float32Array(totalLength);
250
+ let offset = 0;
251
+
252
+ for (const arr of audioArrays) {
253
+ result.set(arr, offset);
254
+ offset += arr.length;
255
+ }
256
+
257
+ return result;
258
+ }
259
  ```
260
 
261
  ---
262
 
263
+ ## 🌟 Advantages
264
 
265
+ βœ… **Privacy-Focused** - All processing in your browser
266
+ βœ… **No Server Costs** - No backend infrastructure needed
267
+ βœ… **Offline Capable** - Works after initial model download
268
+ βœ… **Unlimited Usage** - No API limits or quotas
269
+ βœ… **Fast Generation** - Optimized chunking for speed
270
+ βœ… **High Quality** - Microsoft SpeechT5 architecture
271
+ βœ… **Free & Open** - Apache 2.0 license
272
 
273
+ ---
 
274
 
275
+ ## πŸ“ Limitations
 
276
 
277
+ ⚠️ **Voice Cloning Accuracy** - Simplified algorithm (not production-grade)
278
+ ⚠️ **First Load Time** - ~50MB model download
279
+ ⚠️ **Browser Only** - Requires modern web browser
280
+ ⚠️ **English Optimized** - Best results with English text
281
+ ⚠️ **Memory Usage** - Large texts require more RAM
282
 
283
  ---
284
 
285
+ ## πŸ” Comparison
286
 
287
+ | Feature | This App | Standard SpeechT5 | Cloud TTS APIs |
288
+ |---------|----------|-------------------|----------------|
289
+ | **Voices** | 7 real + cloning | 1 default | 100+ |
290
+ | **Text Length** | Unlimited | Limited | Varies |
291
+ | **Voice Cloning** | βœ… Yes | ❌ No | βœ… Yes (paid) |
292
+ | **Privacy** | βœ… 100% local | βœ… 100% local | ❌ Cloud |
293
+ | **Cost** | Free | Free | Paid |
294
+ | **Internet** | First load only | First load only | Always |
295
+ | **Chunking** | βœ… Automatic | ❌ Manual | βœ… Handled |
296
 
297
  ---
298
 
299
+ ## πŸ› οΈ Development
300
 
301
+ ### Project Structure
302
+ ```
303
+ .
304
+ β”œβ”€β”€ index.html # Main application
305
+ β”œβ”€β”€ assets/
306
+ β”‚ └── style.css # Modern UI styling
307
+ β”œβ”€β”€ README.md # This file
308
+ └── upload_script.py # Hugging Face upload utility
309
+ ```
310
 
311
+ ### Technology Stack
312
+ - **Frontend:** Vanilla JavaScript (ES6+)
313
+ - **ML Framework:** Transformers.js
314
+ - **Runtime:** ONNX Runtime (WASM)
315
+ - **Audio Processing:** Web Audio API
316
+ - **Model:** Xenova/speecht5_tts
317
+ - **Embeddings:** CMU ARCTIC x-vectors
318
 
319
  ---
320
 
321
+ ## πŸ“„ License
322
 
323
  Apache 2.0 - Free for personal and commercial use
324
 
 
326
 
327
  ## πŸ™ Credits
328
 
329
+ - **SpeechT5 Model:** Microsoft Research
330
  - **ONNX Conversion:** Xenova/transformers.js
331
+ - **Speaker Dataset:** CMU ARCTIC
332
+ - **UI Design:** Modern glassmorphism
333
+ - **Voice Cloning:** Web Audio API
334
+
335
+ ---
336
+
337
+ ## πŸ“š Resources
338
+
339
+ - [Transformers.js Docs](https://huggingface.co/docs/transformers.js)
340
+ - [SpeechT5 Paper](https://arxiv.org/abs/2110.07205)
341
+ - [CMU ARCTIC Dataset](http://www.festvox.org/cmu_arctic/)
342
+ - [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API)
343
 
344
  ---
345
 
346
+ **Built with ❀️ using Transformers.js - Bringing AI to the Browser**