Spaces:

WSYBYT
/

ybtts

Running

App Files Files Community

Solution: Multi-Voice TTS with Transformers.js (Browser-Only)

by masbudjj - opened Oct 22, 2025

base: refs/heads/main

←

from: refs/pr/9

Discussion Files changed

+149

-88

Files changed (1) hide show

README.md +149 -88

README.md CHANGED Viewed

@@ -1,136 +1,197 @@
 ---
-title: Kokoro-82M TTS - 54 Premium Voices
 emoji: 🎙️
 colorFrom: indigo
 colorTo: purple
-sdk: gradio
-sdk_version: 4.44.0
-app_file: app.py
 pinned: false
 license: apache-2.0
 ---
-# 🎙️ Kokoro-82M Text-to-Speech
-**World-Class TTS with 54 Premium Voices**
 ## ✨ Features
-### 🎭 54 Premium Voices
-#### 🇺🇸 American English (19 voices)
-**Female (11 voices):**
-- Heart - Warm & Friendly
-- Bella - Elegant & Smooth
-- Nicole - Professional
-- Aoede - Cheerful
-- Kore - Gentle
-- Sarah - Clear
-- Nova - Modern
-- Sky - Light
-- Alloy - Versatile
-- Jessica - Natural
-- River - Calm
-**Male (8 voices):**
-- Michael - Deep & Authoritative
-- Fenrir - Strong
-- Puck - Playful
-- Echo - Resonant
-- Eric - Professional
-- Liam - Friendly
-- Onyx - Rich
-- Adam - Natural
-#### 🇬🇧 British English (8 voices)
-**Female (4 voices):**
-- Emma - Refined
-- Isabella - Elegant
-- Alice - Clear
-- Lily - Soft
-**Male (4 voices):**
-- George - Distinguished
-- Fable - Storyteller
-- Lewis - Smooth
-- Daniel - Professional
 ---
-## 🏗️ Model Architecture
-**Kokoro-82M** based on **StyleTTS 2**:
-- **Parameters**: 82 Million
-- **Decoder**: ISTFTNet
-- **Training**: Few hundred hours of permissive data
-- **License**: Apache 2.0
-- **Paper**: [StyleTTS 2 (arxiv.org/abs/2306.07691)](https://arxiv.org/abs/2306.07691)
 ---
-## 🎯 Features
-✅ **54 Unique Voices** - American & British accents
-✅ **Natural Prosody** - Human-like intonation
-✅ **Fast Generation** - 2-5 seconds per sentence
-✅ **Speed Control** - 0.5x to 2x playback
-✅ **High Quality** - StyleTTS 2 architecture
-✅ **Open Source** - Apache 2.0 license
 ---
-## 💻 Technology Stack
-- **Backend**: Gradio + Hugging Face Inference API
-- **Model**: Kokoro-82M (hexgrad/Kokoro-82M)
-- **Architecture**: StyleTTS 2 + ISTFTNet
-- **Deployment**: Hugging Face Spaces
 ---
-## 🚀 Usage
-1. **Choose Voice** - Select from 54 premium voices
-2. **Enter Text** - Type or paste your content
-3. **Adjust Speed** - Control playback rate (0.5x - 2x)
-4. **Generate** - Click to synthesize speech
-5. **Download** - Save audio as WAV file
 ---
-## 📊 Comparison with Other Models
-| Feature | Kokoro-82M | SpeechT5 | VITS |
-|---------|-----------|----------|------|
-| **Voices** | 54 | 1 | Variable |
-| **Quality** | Excellent | Good | Good |
-| **Speed** | Fast | Medium | Fast |
-| **Accents** | US/UK | Generic | Variable |
-| **License** | Apache 2.0 | Apache 2.0 | MIT |
 ---
-## 🎓 Credits
-- **Model**: [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)
-- **Base Architecture**: StyleTTS 2 by Li et al.
-- **Decoder**: ISTFTNet
-- **Training**: Ethical permissive-licensed data only
 ---
 ## 📝 License
-Apache 2.0 - Free for commercial use
 ---
-## 🔗 Links
-- 📄 [Model Card](https://huggingface.co/hexgrad/Kokoro-82M)
-- 📜 [StyleTTS 2 Paper](https://arxiv.org/abs/2306.07691)
-- 🐙 [GitHub (ONNX)](https://github.com/thewh1teagle/kokoro-onnx)
 ---
-**Built with ❤️ using Kokoro-82M & Gradio**

 ---
+title: Multi-Voice TTS - 24 Unique Voices
 emoji: 🎙️
 colorFrom: indigo
 colorTo: purple
+sdk: static
 pinned: false
 license: apache-2.0
 ---
+# 🎙️ Multi-Voice Text-to-Speech
+**24 Unique Voices - 100% Browser-Based - No Server Required**
 ## ✨ Features
+### 🎭 24 Unique Voice Characters
+#### 🇺🇸 American Female (6 voices)
+- **Default** - Neutral baseline
+- **Warm** - Friendly & caring
+- **Bright** - Energetic & happy
+- **Soft** - Gentle & calm
+- **Clear** - Professional
+- **Smooth** - Elegant
+#### 🇺🇸 American Male (6 voices)
+- **Default** - Neutral baseline
+- **Deep** - Authoritative
+- **Friendly** - Approachable
+- **Strong** - Confident
+- **Calm** - Relaxed
+- **Professional** - Business-oriented
+#### 🇬🇧 British Female (4 voices)
+- **Refined** - Elegant
+- **Bright** - Cheerful
+- **Soft** - Gentle
+- **Clear** - Articulate
+#### 🇬🇧 British Male (4 voices)
+- **Distinguished** - Formal
+- **Smooth** - Sophisticated
+- **Warm** - Friendly
+- **Strong** - Commanding
+#### 🌏 International (4 voices)
+- **Neutral** - Standard
+- **Soft** - Gentle
+- **Clear** - Professional
+- **Warm** - Friendly
 ---
+## 🎨 Voice Customization
+Each voice can be further customized with:
+- **Pitch Control** (0.5x - 1.5x) - Adjust voice pitch
+- **Energy Control** (0.5x - 1.5x) - Modify speaking energy
+- **Speed Control** (0.5x - 2.0x) - Playback speed
+**Total Combinations:** 24 voices × unlimited pitch/energy variations = **Infinite possibilities!**
 ---
+## 🏗️ Technology
+### Base Model
+- **SpeechT5** from Microsoft
+- **ONNX Runtime** for browser execution
+- **WebAssembly** backend
+### Voice Generation
+Each of the 24 voices is created by:
+1. Taking base speaker embedding (512-dim)
+2. Applying pitch transformation
+3. Modulating energy levels
+4. Spectral shaping for character
+5. Prosody adjustment
+6. Normalization
 ---
+## 🚀 Features
+✅ **24 Unique Voices** - Diverse characters
+✅ **100% Browser-Based** - No server needed
+✅ **Voice Customization** - Pitch & energy controls
+✅ **Fast Generation** - 2-5 seconds
+✅ **High Quality** - SpeechT5 architecture
+✅ **Offline Capable** - Works after first load
+✅ **Privacy Focused** - No data sent to servers
+✅ **Free & Open Source** - Apache 2.0 license
+---
+## 💻 How It Works
+### Voice Profile System
+```javascript
+const VOICE_PROFILES = {
+  af_warm: {
+    pitch: 0.95,    // Slightly lower
+    energy: 1.1,    // More energetic
+    spectral: 0.2   // Brighter tone
+  },
+  am_deep: {
+    pitch: 0.7,     // Much lower
+    energy: 1.1,    // Strong
+    spectral: -0.5  // Darker tone
+  },
+  // ... 24 total profiles
+};
+```
+### Generation Process
+```
+User Input Text
+     ↓
+Select Voice Profile
+     ↓
+Load Base Speaker Embedding
+     ↓
+Apply Transformations:
+  - Pitch modification
+  - Energy modulation
+  - Spectral shaping
+  - User adjustments (pitch/energy sliders)
+     ↓
+Normalize Embedding
+     ↓
+SpeechT5 Generation
+     ↓
+WAV Output
+```
 ---
+## 🎯 Use Cases
+**Professional/Corporate:**
+- af_clear, am_professional, bf_clear, bm_distinguished
+**Friendly/Casual:**
+- af_warm, am_friendly, bf_bright, int_warm
+**Storytelling/Narration:**
+- af_smooth, am_calm, bf_refined, bm_smooth
+**Energetic/Marketing:**
+- af_bright, am_strong, bf_bright
 ---
+## 📊 Comparison
+| Feature | This App | SpeechT5 Basic | Kokoro-82M |
+|---------|----------|----------------|------------|
+| **Voices** | 24 | 1 | 54 |
+| **Browser** | ✅ Yes | ✅ Yes | ❌ No |
+| **Customization** | ✅ Pitch/Energy | ❌ Limited | ✅ Yes |
+| **Server** | ❌ Not needed | ❌ Not needed | ✅ Required |
+| **Speed** | ⚡ Fast | ⚡ Fast | ⏱️ Medium |
 ---
+## 🔧 Technical Details
+**Model:** Xenova/speecht5_tts
+**Size:** ~50MB (cached after first load)
+**Format:** ONNX (quantized)
+**Sample Rate:** 16kHz
+**Output:** WAV (16-bit PCM)
+**Voice Embedding:** 512-dimensional vector
+**Transformations:** Pitch, energy, spectral
+**Normalization:** Z-score (mean=0, std=1)
 ---
 ## 📝 License
+Apache 2.0 - Free for personal and commercial use
 ---
+## 🙏 Credits
+- **Base Model:** Microsoft SpeechT5
+- **ONNX Conversion:** Xenova/transformers.js
+- **Voice Profiles:** Custom implementation
+- **UI:** Modern glassmorphism design
 ---
+**Built with ❤️ using Transformers.js**