Solution: Multi-Voice TTS with Transformers.js (Browser-Only)

#9
by masbudjj - opened
Files changed (1) hide show
  1. README.md +149 -88
README.md CHANGED
@@ -1,136 +1,197 @@
1
  ---
2
- title: Kokoro-82M TTS - 54 Premium Voices
3
  emoji: πŸŽ™οΈ
4
  colorFrom: indigo
5
  colorTo: purple
6
- sdk: gradio
7
- sdk_version: 4.44.0
8
- app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- # πŸŽ™οΈ Kokoro-82M Text-to-Speech
14
 
15
- **World-Class TTS with 54 Premium Voices**
16
 
17
  ## ✨ Features
18
 
19
- ### 🎭 54 Premium Voices
20
-
21
- #### πŸ‡ΊπŸ‡Έ American English (19 voices)
22
- **Female (11 voices):**
23
- - Heart - Warm & Friendly
24
- - Bella - Elegant & Smooth
25
- - Nicole - Professional
26
- - Aoede - Cheerful
27
- - Kore - Gentle
28
- - Sarah - Clear
29
- - Nova - Modern
30
- - Sky - Light
31
- - Alloy - Versatile
32
- - Jessica - Natural
33
- - River - Calm
34
-
35
- **Male (8 voices):**
36
- - Michael - Deep & Authoritative
37
- - Fenrir - Strong
38
- - Puck - Playful
39
- - Echo - Resonant
40
- - Eric - Professional
41
- - Liam - Friendly
42
- - Onyx - Rich
43
- - Adam - Natural
44
-
45
- #### πŸ‡¬πŸ‡§ British English (8 voices)
46
- **Female (4 voices):**
47
- - Emma - Refined
48
- - Isabella - Elegant
49
- - Alice - Clear
50
- - Lily - Soft
51
-
52
- **Male (4 voices):**
53
- - George - Distinguished
54
- - Fable - Storyteller
55
- - Lewis - Smooth
56
- - Daniel - Professional
57
 
58
  ---
59
 
60
- ## πŸ—οΈ Model Architecture
61
 
62
- **Kokoro-82M** based on **StyleTTS 2**:
63
- - **Parameters**: 82 Million
64
- - **Decoder**: ISTFTNet
65
- - **Training**: Few hundred hours of permissive data
66
- - **License**: Apache 2.0
67
- - **Paper**: [StyleTTS 2 (arxiv.org/abs/2306.07691)](https://arxiv.org/abs/2306.07691)
 
68
 
69
  ---
70
 
71
- ## 🎯 Features
 
 
 
 
 
72
 
73
- βœ… **54 Unique Voices** - American & British accents
74
- βœ… **Natural Prosody** - Human-like intonation
75
- βœ… **Fast Generation** - 2-5 seconds per sentence
76
- βœ… **Speed Control** - 0.5x to 2x playback
77
- βœ… **High Quality** - StyleTTS 2 architecture
78
- βœ… **Open Source** - Apache 2.0 license
 
 
79
 
80
  ---
81
 
82
- ## πŸ’» Technology Stack
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- - **Backend**: Gradio + Hugging Face Inference API
85
- - **Model**: Kokoro-82M (hexgrad/Kokoro-82M)
86
- - **Architecture**: StyleTTS 2 + ISTFTNet
87
- - **Deployment**: Hugging Face Spaces
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
  ---
90
 
91
- ## πŸš€ Usage
 
 
 
 
 
 
92
 
93
- 1. **Choose Voice** - Select from 54 premium voices
94
- 2. **Enter Text** - Type or paste your content
95
- 3. **Adjust Speed** - Control playback rate (0.5x - 2x)
96
- 4. **Generate** - Click to synthesize speech
97
- 5. **Download** - Save audio as WAV file
98
 
99
  ---
100
 
101
- ## πŸ“Š Comparison with Other Models
102
 
103
- | Feature | Kokoro-82M | SpeechT5 | VITS |
104
- |---------|-----------|----------|------|
105
- | **Voices** | 54 | 1 | Variable |
106
- | **Quality** | Excellent | Good | Good |
107
- | **Speed** | Fast | Medium | Fast |
108
- | **Accents** | US/UK | Generic | Variable |
109
- | **License** | Apache 2.0 | Apache 2.0 | MIT |
110
 
111
  ---
112
 
113
- ## πŸŽ“ Credits
 
 
 
 
 
 
114
 
115
- - **Model**: [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)
116
- - **Base Architecture**: StyleTTS 2 by Li et al.
117
- - **Decoder**: ISTFTNet
118
- - **Training**: Ethical permissive-licensed data only
119
 
120
  ---
121
 
122
  ## πŸ“ License
123
 
124
- Apache 2.0 - Free for commercial use
125
 
126
  ---
127
 
128
- ## πŸ”— Links
129
 
130
- - πŸ“„ [Model Card](https://huggingface.co/hexgrad/Kokoro-82M)
131
- - πŸ“œ [StyleTTS 2 Paper](https://arxiv.org/abs/2306.07691)
132
- - πŸ™ [GitHub (ONNX)](https://github.com/thewh1teagle/kokoro-onnx)
 
133
 
134
  ---
135
 
136
- **Built with ❀️ using Kokoro-82M & Gradio**
 
1
  ---
2
+ title: Multi-Voice TTS - 24 Unique Voices
3
  emoji: πŸŽ™οΈ
4
  colorFrom: indigo
5
  colorTo: purple
6
+ sdk: static
 
 
7
  pinned: false
8
  license: apache-2.0
9
  ---
10
 
11
+ # πŸŽ™οΈ Multi-Voice Text-to-Speech
12
 
13
+ **24 Unique Voices - 100% Browser-Based - No Server Required**
14
 
15
  ## ✨ Features
16
 
17
+ ### 🎭 24 Unique Voice Characters
18
+
19
+ #### πŸ‡ΊπŸ‡Έ American Female (6 voices)
20
+ - **Default** - Neutral baseline
21
+ - **Warm** - Friendly & caring
22
+ - **Bright** - Energetic & happy
23
+ - **Soft** - Gentle & calm
24
+ - **Clear** - Professional
25
+ - **Smooth** - Elegant
26
+
27
+ #### πŸ‡ΊπŸ‡Έ American Male (6 voices)
28
+ - **Default** - Neutral baseline
29
+ - **Deep** - Authoritative
30
+ - **Friendly** - Approachable
31
+ - **Strong** - Confident
32
+ - **Calm** - Relaxed
33
+ - **Professional** - Business-oriented
34
+
35
+ #### πŸ‡¬πŸ‡§ British Female (4 voices)
36
+ - **Refined** - Elegant
37
+ - **Bright** - Cheerful
38
+ - **Soft** - Gentle
39
+ - **Clear** - Articulate
40
+
41
+ #### πŸ‡¬πŸ‡§ British Male (4 voices)
42
+ - **Distinguished** - Formal
43
+ - **Smooth** - Sophisticated
44
+ - **Warm** - Friendly
45
+ - **Strong** - Commanding
46
+
47
+ #### 🌏 International (4 voices)
48
+ - **Neutral** - Standard
49
+ - **Soft** - Gentle
50
+ - **Clear** - Professional
51
+ - **Warm** - Friendly
 
 
 
52
 
53
  ---
54
 
55
+ ## 🎨 Voice Customization
56
 
57
+ Each voice can be further customized with:
58
+
59
+ - **Pitch Control** (0.5x - 1.5x) - Adjust voice pitch
60
+ - **Energy Control** (0.5x - 1.5x) - Modify speaking energy
61
+ - **Speed Control** (0.5x - 2.0x) - Playback speed
62
+
63
+ **Total Combinations:** 24 voices Γ— unlimited pitch/energy variations = **Infinite possibilities!**
64
 
65
  ---
66
 
67
+ ## πŸ—οΈ Technology
68
+
69
+ ### Base Model
70
+ - **SpeechT5** from Microsoft
71
+ - **ONNX Runtime** for browser execution
72
+ - **WebAssembly** backend
73
 
74
+ ### Voice Generation
75
+ Each of the 24 voices is created by:
76
+ 1. Taking base speaker embedding (512-dim)
77
+ 2. Applying pitch transformation
78
+ 3. Modulating energy levels
79
+ 4. Spectral shaping for character
80
+ 5. Prosody adjustment
81
+ 6. Normalization
82
 
83
  ---
84
 
85
+ ## πŸš€ Features
86
+
87
+ βœ… **24 Unique Voices** - Diverse characters
88
+ βœ… **100% Browser-Based** - No server needed
89
+ βœ… **Voice Customization** - Pitch & energy controls
90
+ βœ… **Fast Generation** - 2-5 seconds
91
+ βœ… **High Quality** - SpeechT5 architecture
92
+ βœ… **Offline Capable** - Works after first load
93
+ βœ… **Privacy Focused** - No data sent to servers
94
+ βœ… **Free & Open Source** - Apache 2.0 license
95
+
96
+ ---
97
 
98
+ ## πŸ’» How It Works
99
+
100
+ ### Voice Profile System
101
+ ```javascript
102
+ const VOICE_PROFILES = {
103
+ af_warm: {
104
+ pitch: 0.95, // Slightly lower
105
+ energy: 1.1, // More energetic
106
+ spectral: 0.2 // Brighter tone
107
+ },
108
+ am_deep: {
109
+ pitch: 0.7, // Much lower
110
+ energy: 1.1, // Strong
111
+ spectral: -0.5 // Darker tone
112
+ },
113
+ // ... 24 total profiles
114
+ };
115
+ ```
116
+
117
+ ### Generation Process
118
+ ```
119
+ User Input Text
120
+ ↓
121
+ Select Voice Profile
122
+ ↓
123
+ Load Base Speaker Embedding
124
+ ↓
125
+ Apply Transformations:
126
+ - Pitch modification
127
+ - Energy modulation
128
+ - Spectral shaping
129
+ - User adjustments (pitch/energy sliders)
130
+ ↓
131
+ Normalize Embedding
132
+ ↓
133
+ SpeechT5 Generation
134
+ ↓
135
+ WAV Output
136
+ ```
137
 
138
  ---
139
 
140
+ ## 🎯 Use Cases
141
+
142
+ **Professional/Corporate:**
143
+ - af_clear, am_professional, bf_clear, bm_distinguished
144
+
145
+ **Friendly/Casual:**
146
+ - af_warm, am_friendly, bf_bright, int_warm
147
 
148
+ **Storytelling/Narration:**
149
+ - af_smooth, am_calm, bf_refined, bm_smooth
150
+
151
+ **Energetic/Marketing:**
152
+ - af_bright, am_strong, bf_bright
153
 
154
  ---
155
 
156
+ ## πŸ“Š Comparison
157
 
158
+ | Feature | This App | SpeechT5 Basic | Kokoro-82M |
159
+ |---------|----------|----------------|------------|
160
+ | **Voices** | 24 | 1 | 54 |
161
+ | **Browser** | βœ… Yes | βœ… Yes | ❌ No |
162
+ | **Customization** | βœ… Pitch/Energy | ❌ Limited | βœ… Yes |
163
+ | **Server** | ❌ Not needed | ❌ Not needed | βœ… Required |
164
+ | **Speed** | ⚑ Fast | ⚑ Fast | ⏱️ Medium |
165
 
166
  ---
167
 
168
+ ## πŸ”§ Technical Details
169
+
170
+ **Model:** Xenova/speecht5_tts
171
+ **Size:** ~50MB (cached after first load)
172
+ **Format:** ONNX (quantized)
173
+ **Sample Rate:** 16kHz
174
+ **Output:** WAV (16-bit PCM)
175
 
176
+ **Voice Embedding:** 512-dimensional vector
177
+ **Transformations:** Pitch, energy, spectral
178
+ **Normalization:** Z-score (mean=0, std=1)
 
179
 
180
  ---
181
 
182
  ## πŸ“ License
183
 
184
+ Apache 2.0 - Free for personal and commercial use
185
 
186
  ---
187
 
188
+ ## πŸ™ Credits
189
 
190
+ - **Base Model:** Microsoft SpeechT5
191
+ - **ONNX Conversion:** Xenova/transformers.js
192
+ - **Voice Profiles:** Custom implementation
193
+ - **UI:** Modern glassmorphism design
194
 
195
  ---
196
 
197
+ **Built with ❀️ using Transformers.js**