Spaces:

WSYBYT
/

ybtts

Running

App Files Files Community

Update README.md - Ultimate TTS with 900+ voices

#17

by masbudjj - opened Oct 22, 2025

base: refs/heads/main

←

from: refs/pr/17

Discussion Files changed

+261

-273

Files changed (1) hide show

README.md +261 -273

README.md CHANGED Viewed

@@ -1,346 +1,334 @@
 ---
-title: Advanced TTS - Real Voices + Voice Cloning
 emoji: 🎙️
-colorFrom: indigo
 colorTo: purple
 sdk: static
-pinned: false
 license: apache-2.0
 ---
-# 🎙️ Advanced Text-to-Speech System
-**7 Authentic Voices + Voice Cloning + Unlimited Text - 100% Browser-Based**
-## ✨ Key Features
-### 🎭 Dual Voice Modes
-#### 📚 Preset Voices (7 Authentic Speakers)
-Real speaker embeddings from the CMU ARCTIC dataset:
-**🇺🇸 American Voices:**
-- **Sarah (slt)** - Female, Clear & Professional
-- **Clara (clb)** - Female, Warm & Friendly
-- **Ben (bdl)** - Male, Deep & Authoritative
-- **Robert (rms)** - Male, Calm & Relaxed
-**🌍 International Voices:**
-- **Andrew (awb)** - Scottish Male, Distinguished
-- **James (jmk)** - Canadian Male, Friendly
-- **Kiran (ksp)** - Indian Male, Professional
-#### 🎤 Voice Cloning Mode
-Upload your own voice sample (up to 1 minute) and the system will:
-- Extract voice characteristics
-- Auto-compress large files
-- Resample to optimal quality (16kHz)
-- Convert stereo to mono
-- Generate 512-dim voice embedding
-**Supported formats:** WAV, MP3
-**Max duration:** 60 seconds (auto-trim)
-**Processing:** Automatic compression & resampling
----
-## 📝 Unlimited Text Processing
-### Smart Chunking System
-- **Automatic splitting** - Intelligently splits by sentences
-- **200 chars per chunk** - Optimal for quality & speed
-- **Seamless concatenation** - Merges all chunks into single audio
-- **Real-time progress** - Track each chunk being processed
-**No character limits!** Type as much text as you want.
----
-## 🎨 Advanced Features
-### ⚙️ Audio Controls
-- **Speed Control** - 0.5x to 2.0x playback speed
-- **Real-time adjustment** - Change speed during playback
-### 📊 Live Monitoring
-- **Character counter** - Total text length
-- **Word counter** - Word count
-- **Chunk calculator** - Estimated processing chunks
-- **Progress bar** - Visual generation progress
-- **Activity log** - Detailed processing steps
-### 💾 Download & Playback
-- **Browser audio player** - Built-in controls
-- **WAV format** - High-quality 16-bit PCM
-- **Download option** - Save generated audio
----
-## 🏗️ Technical Architecture
-### Model & Runtime
-- **Base Model:** Microsoft SpeechT5 (Xenova/speecht5_tts)
-- **Runtime:** ONNX Runtime (WebAssembly)
-- **Framework:** Transformers.js 3.1.2
-- **Execution:** 100% client-side (no server)
-### Voice System
-- **Speaker Embeddings:** 512-dimensional x-vectors
-- **Dataset:** CMU ARCTIC (7 speakers)
-- **Cloning:** Web Audio API + spectral analysis
-- **Format:** Float32Array, normalized
-### Audio Processing
-```javascript
-Input Audio
-    ↓
-Duration Check (trim if > 60s)
-    ↓
-Resample to 16kHz
-    ↓
-Convert to Mono
-    ↓
-Extract Features (mean, variance, spectral)
-    ↓
-Generate 512-dim Embedding
-    ↓
-Normalize (L2 norm)
-    ↓
-Ready for TTS
-```
-### Text Processing Pipeline
-```javascript
-User Input Text
-    ↓
-Split by Sentences
-    ↓
-Group into 200-char Chunks
-    ↓
-Process Each Chunk:
-  - Generate with TTS
-  - Use selected voice embedding
-  - Update progress
-    ↓
-Concatenate All Audio
-    ↓
-Encode to WAV
-    ↓
-Present to User
-```
----
-## 🚀 How It Works
-### Preset Voice Generation
-1. Select voice from dropdown (e.g., "Sarah - Female")
-2. Enter text (unlimited length)
-3. Click "Generate Speech"
-4. System splits text into chunks
-5. Processes each chunk with selected voice
-6. Concatenates all audio
-7. Presents final WAV file
-### Voice Cloning Workflow
-1. Switch to "Voice Clone" mode
-2. Upload voice sample (WAV/MP3, max 60s)
-3. Click "Process Voice Sample"
-4. System extracts voice characteristics
-5. Enter text to generate
-6. Click "Generate Speech"
-7. Your voice clone reads the text!
----
-## 💻 Browser Requirements
-**Minimum Requirements:**
-- Modern browser (Chrome 90+, Firefox 88+, Safari 14+)
-- JavaScript enabled
-- ~100MB RAM for model
-- ~50MB storage for model cache
-**Optimal Experience:**
-- Chrome/Edge with WebGPU support
-- 4GB+ RAM
-- Fast internet (first load only)
----
-## 📊 Performance
-| Metric | Value |
-|--------|-------|
-| **Model Size** | ~50MB (cached after first load) |
-| **Voice Load Time** | ~5-10s (first time only) |
-| **Generation Speed** | ~2-5s per 200 chars |
-| **Sample Rate** | 16kHz |
-| **Audio Format** | WAV (16-bit PCM) |
-| **Max Text Length** | Unlimited (chunked) |
----
 ## 🎯 Use Cases
-### Professional
-- **Corporate videos** - Ben (authoritative), Robert (calm)
-- **Training materials** - Sarah (clear), Kiran (professional)
-- **Presentations** - Clara (warm), James (friendly)
-### Creative
-- **Audiobooks** - Andrew (distinguished), Robert (relaxed)
-- **Podcasts** - Use voice cloning for consistency
-- **Voice-overs** - Multiple character voices
 ### Accessibility
-- **Screen readers** - Clear, natural voices
-- **Language learning** - Different accents
-- **Content accessibility** - Convert text to audio
----
 ## 🔧 Technical Details
-### Voice Embedding Extraction (Cloning)
-```javascript
-// Simplified process
-1. Load audio file
-2. Decode to AudioBuffer
-3. Resample to 16kHz if needed
-4. Convert stereo → mono
-5. Split into 512 chunks
-6. Calculate mean & variance per chunk
-7. Combine to create embedding
-8. Normalize (L2 norm = 1)
 ```
-### Chunking Algorithm
-```javascript
-function chunkText(text, maxChars = 200) {
-  // Split by sentence boundaries
-  const sentences = text.match(/[^.!?]+[.!?]+/g);
-  // Group sentences into chunks ≤ maxChars
-  const chunks = [];
-  let currentChunk = "";
-  for (const sentence of sentences) {
-    if ((currentChunk + sentence).length <= maxChars) {
-      currentChunk += sentence;
-    } else {
-      chunks.push(currentChunk.trim());
-      currentChunk = sentence;
-    }
-  }
-  return chunks;
-}
 ```
-### Audio Concatenation
-```javascript
-function concatenateAudio(audioArrays, sampleRate) {
-  // Calculate total length
-  const totalLength = audioArrays.reduce((sum, arr) =>
-    sum + arr.length, 0);
-  // Merge all chunks
-  const result = new Float32Array(totalLength);
-  let offset = 0;
-  for (const arr of audioArrays) {
-    result.set(arr, offset);
-    offset += arr.length;
-  }
-  return result;
-}
-```
----
-## 🌟 Advantages
-✅ **Privacy-Focused** - All processing in your browser
-✅ **No Server Costs** - No backend infrastructure needed
-✅ **Offline Capable** - Works after initial model download
-✅ **Unlimited Usage** - No API limits or quotas
-✅ **Fast Generation** - Optimized chunking for speed
-✅ **High Quality** - Microsoft SpeechT5 architecture
-✅ **Free & Open** - Apache 2.0 license
----
-## 📝 Limitations
-⚠️ **Voice Cloning Accuracy** - Simplified algorithm (not production-grade)
-⚠️ **First Load Time** - ~50MB model download
-⚠️ **Browser Only** - Requires modern web browser
-⚠️ **English Optimized** - Best results with English text
-⚠️ **Memory Usage** - Large texts require more RAM
----
-## 🔍 Comparison
-| Feature | This App | Standard SpeechT5 | Cloud TTS APIs |
-|---------|----------|-------------------|----------------|
-| **Voices** | 7 real + cloning | 1 default | 100+ |
-| **Text Length** | Unlimited | Limited | Varies |
-| **Voice Cloning** | ✅ Yes | ❌ No | ✅ Yes (paid) |
-| **Privacy** | ✅ 100% local | ✅ 100% local | ❌ Cloud |
-| **Cost** | Free | Free | Paid |
-| **Internet** | First load only | First load only | Always |
-| **Chunking** | ✅ Automatic | ❌ Manual | ✅ Handled |
----
-## 🛠️ Development
-### Project Structure
-```
-.
-├── index.html              # Main application
-├── assets/
-│   └── style.css          # Modern UI styling
-├── README.md              # This file
-└── upload_script.py       # Hugging Face upload utility
-```
-### Technology Stack
-- **Frontend:** Vanilla JavaScript (ES6+)
-- **ML Framework:** Transformers.js
-- **Runtime:** ONNX Runtime (WASM)
-- **Audio Processing:** Web Audio API
-- **Model:** Xenova/speecht5_tts
-- **Embeddings:** CMU ARCTIC x-vectors
----
-## 📄 License
-Apache 2.0 - Free for personal and commercial use
----
-## 🙏 Credits
-- **SpeechT5 Model:** Microsoft Research
-- **ONNX Conversion:** Xenova/transformers.js
-- **Speaker Dataset:** CMU ARCTIC
-- **UI Design:** Modern glassmorphism
-- **Voice Cloning:** Web Audio API
----
-## 📚 Resources
-- [Transformers.js Docs](https://huggingface.co/docs/transformers.js)
-- [SpeechT5 Paper](https://arxiv.org/abs/2110.07205)
-- [CMU ARCTIC Dataset](http://www.festvox.org/cmu_arctic/)
-- [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API)
 ---
-**Built with ❤️ using Transformers.js - Bringing AI to the Browser**

 ---
+title: Ultimate TTS Studio - 900+ Premium Voices
 emoji: 🎙️
+colorFrom: blue
 colorTo: purple
 sdk: static
+pinned: true
 license: apache-2.0
 ---
+# 🎙️ Ultimate TTS Studio
+**900+ Premium Voices** from 3 World-Class TTS Engines - All Running in Your Browser!
+## ✨ Features
+### 🎯 3 Premium TTS Engines
+1. **🎯 Piper TTS** - 904 voices across 50+ languages
+   - High-quality multilingual support
+   - Multiple quality levels (High/Medium/Low)
+   - 3-5x realtime generation speed
+2. **✨ Kokoro TTS** - 21 expressive voices (Highest Quality)
+   - 24kHz studio-quality audio
+   - American & British accents
+   - Most natural & expressive
+3. **⚡ Kitten TTS** - 8 voices (Fastest)
+   - Only 24MB model size
+   - Lightning-fast generation
+   - Perfect for quick tasks
+### 🚀 Key Capabilities
+- ✅ **900+ Professional Voices** - Choose from massive variety
+- ✅ **50+ Languages** - Speak in any language with Piper
+- ✅ **Unlimited Text Length** - Automatic smart chunking
+- ✅ **WebGPU Acceleration** - Hardware-accelerated when available
+- ✅ **Zero Server Cost** - 100% client-side processing
+- ✅ **Offline Capable** - Works after models cached
+- ✅ **Privacy First** - No data leaves your browser
+- ✅ **Professional Quality** - Up to 24kHz audio output
+## 🎮 How to Use
+### 1. Select Your Engine
+**For Maximum Variety:** Choose **Piper TTS**
+- 904 voices across 50+ languages
+- Select quality level (High/Medium/Low)
+- Pick language and accent
+**For Best Quality:** Choose **Kokoro TTS**
+- 21 expressive voices
+- Studio-quality 24kHz audio
+- Perfect for audiobooks & narration
+**For Speed:** Choose **Kitten TTS**
+- 8 fast voices
+- Lightweight model (24MB)
+- Quick generation
+### 2. Configure Voice
+#### Piper Options:
+- **Quality:** High (22kHz) / Medium (16kHz) / Low (Fast)
+- **Languages:** English (US/GB), Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, + 40 more!
+- **Top Voices:** Lessac, Ryan (US) | Cori, Alan (GB)
+#### Kokoro Options:
+- **American:** Bella, Nicole, Sarah, Sky, Adam, Michael
+- **British:** Emma, Isabella, George, Lewis
+#### Kitten Options:
+- 8 voices (Voice 0-7) with different characteristics
+### 3. Enter Text & Generate
+1. Type or paste your text (unlimited length)
+2. Adjust speed if needed (0.5x - 2.0x)
+3. Click "🎤 Generate Speech"
+4. Wait for generation (watch progress bar)
+5. Play audio or download as WAV
+## 🌐 Supported Languages
+### Piper TTS - 50+ Languages:
+**Major Languages:**
+- 🇺🇸 English (US) - 20+ voices
+- 🇬🇧 English (UK) - 15+ voices
+- 🇪🇸 Spanish - 30+ voices
+- 🇫🇷 French - 25+ voices
+- 🇩🇪 German - 20+ voices
+- 🇮🇹 Italian - 15+ voices
+- 🇵🇹 Portuguese - 10+ voices
+- 🇨🇳 Chinese - 10+ voices
+- 🇯🇵 Japanese - 5+ voices
+- 🇰🇷 Korean - 5+ voices
+Plus: Dutch, Russian, Polish, Turkish, Arabic, Hindi, Vietnamese, Thai, and many more!
+## 📊 Engine Comparison
+| Feature | Piper | Kokoro | Kitten |
+|---------|-------|--------|--------|
+| **Voices** | 904 | 21 | 8 |
+| **Quality** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
+| **Speed** | Medium | Medium | Fast |
+| **Model Size** | ~50MB | ~80MB | ~24MB |
+| **Languages** | 50+ | English | English |
+| **Sample Rate** | 16-22kHz | 24kHz | 16kHz |
+| **Best For** | Variety | Quality | Speed |
 ## 🎯 Use Cases
+### Content Creation
+- 🎬 Video voiceovers & narration
+- 📚 Audiobook production
+- 🎙️ Podcast intros/outros
+- 📺 YouTube tutorials
 ### Accessibility
+- 👁️ Screen reader alternatives
+- 📖 Reading assistance
+- 🌍 Language learning
+- 📱 Audio content for visually impaired
+### Development
+- 🤖 Voice UI prototyping
+- 🎮 Game character voices
+- 📞 IVR system testing
+- 💬 Chatbot voice responses
 ## 🔧 Technical Details
+### Technology Stack
+- **Frontend:** Pure HTML5 + JavaScript (ES6+)
+- **TTS Library:** onnx-tts-web
+- **Runtime:** ONNX Runtime Web
+- **Acceleration:** WebGPU / WebAssembly
+- **Audio:** Web Audio API
+### Model Sources
+- **Piper:** [rhasspy/piper-voices](https://huggingface.co/rhasspy/piper-voices)
+- **Kokoro:** [therealtimex/kokoro-tts-web](https://huggingface.co/therealtimex/kokoro-tts-web)
+- **Kitten:** [therealtimex/kitten-tts-web](https://huggingface.co/therealtimex/kitten-tts-web)
+### Browser Requirements
+- **Minimum:** Chrome 90+ / Firefox 88+ / Safari 14+ / Edge 90+
+- **Recommended:** Latest Chrome/Edge with WebGPU enabled
+- **Features Required:** WebAssembly, Web Audio API
+- **Optional:** WebGPU for acceleration
+### Performance
+- **Model Loading:** 5-15 seconds (first time only, then cached)
+- **Generation Speed:** 2-5 seconds per 200 characters
+- **Real-time Factor:** 3-10x (depending on hardware & engine)
+- **Memory Usage:** ~200-500MB (with models loaded)
+## 💡 Performance Tips
+### For Best Quality:
+1. Use **Kokoro TTS** for English content
+2. Select **High Quality** in Piper settings
+3. Use well-punctuated text
+4. Keep sentences moderate length
+### For Best Speed:
+1. Use **Kitten TTS** for quick tasks
+2. Select **Low Quality** in Piper
+3. Enable WebGPU in browser settings
+4. Use shorter text inputs
+### For Most Options:
+1. Use **Piper TTS** for language variety
+2. Explore different accents/regions
+3. Compare quality levels
+4. Try multiple voices for same language
+## 🎬 Quick Start Examples
+### Example 1: Professional Audiobook
+```
+Engine: Kokoro TTS
+Voice: Bella (American Female)
+Speed: 0.95x
+Quality: 24kHz
+Text: Your book chapter...
+```
+### Example 2: Tutorial Narration
+```
+Engine: Piper TTS
+Voice: Lessac (US, High Quality)
+Speed: 1.0x
+Quality: 22kHz
+Text: Your tutorial script...
 ```
+### Example 3: Quick Announcement
+```
+Engine: Kitten TTS
+Voice: Voice 4 (Clear)
+Speed: 1.1x
+Text: Your announcement...
 ```
+### Example 4: Spanish Content
+```
+Engine: Piper TTS
+Voice: es_ES (Spain Spanish)
+Speed: 1.0x
+Quality: High
+Text: Su texto en español...
+```
+## 🐛 Troubleshooting
+### Model Loading Issues
+**Problem:** "ERROR initializing" message
+**Solutions:**
+- Check internet connection
+- Wait for download to complete
+- Try different quality level
+- Clear browser cache
+- Refresh page
+### No Audio Output
+**Problem:** Player appears but no sound
+**Solutions:**
+- Check browser audio permissions
+- Verify volume settings
+- Try different voice/engine
+- Check browser console (F12)
+- Test with different browser
+### Slow Performance
+**Problem:** Generation takes too long
+**Solutions:**
+- Switch to Kitten TTS for speed
+- Lower quality in Piper settings
+- Enable WebGPU (`chrome://flags`)
+- Update browser to latest version
+- Close other tabs/applications
+### WebGPU Not Available
+**Problem:** Shows "WASM" instead of "WebGPU"
+**Solutions:**
+- Update browser to latest version
+- Enable in `chrome://flags` → "WebGPU"
+- Check GPU driver updates
+- WebGPU optional, WASM works fine
+## 🎯 Voice Recommendations
+### English (US) - Natural:
+- **Lessac** (Piper) - Professional, clear
+- **Ryan** (Piper) - Authoritative, deep
+- **Bella** (Kokoro) - Elegant, sophisticated
+### English (GB) - British:
+- **Cori** (Piper) - Refined, professional
+- **Emma** (Kokoro) - Elegant, polished
+- **George** (Kokoro) - Commanding, distinguished
+### Spanish:
+- **es_ES** (Piper) - Spain Spanish, multiple voices
+- **es_MX** (Piper) - Mexican Spanish
+### French:
+- **fr_FR** (Piper) - France French, multiple voices
+### German:
+- **de_DE** (Piper) - German, multiple voices
+## 📝 Privacy & Security
+✅ **100% Client-Side** - All processing in your browser
+✅ **No Server Upload** - Text never leaves your device
+✅ **No Data Collection** - Zero analytics or tracking
+✅ **No Account Required** - Use instantly, no signup
+✅ **Offline Capable** - Works without internet (after cache)
+## 📜 License & Credits
+### License
+This project is released under the **Apache 2.0 License**.
+### Credits & Acknowledgments
+**Libraries & Tools:**
+- [onnx-tts-web](https://github.com/therealtimex/onnx-tts-web) by @therealtimex
+- [Piper TTS](https://github.com/rhasspy/piper) by Rhasspy
+- [ONNX Runtime](https://onnxruntime.ai/) by Microsoft
+**Models:**
+- Piper TTS models by Rhasspy team
+- Kokoro TTS by community contributors
+- Kitten TTS by community contributors
+**Inspiration:**
+- [TTS Studio](https://github.com/clowerweb/tts-studio) by @clowerweb
+## 🚀 Future Enhancements
+Planned features:
+- [ ] More TTS engines (Coqui, VITS)
+- [ ] Voice cloning with SpeechT5
+- [ ] SSML markup support
+- [ ] Batch processing
+- [ ] MP3/OGG export
+- [ ] Voice mixing/blending
+- [ ] Real-time streaming
+- [ ] Pronunciation dictionary
+## 🤝 Contributing
+Found a bug or have a suggestion? Please open an issue or submit a pull request!
+## 🌟 Star This Space!
+If you find this useful, please give it a ⭐ star on HuggingFace!
 ---
+**Made with ❤️ for the open-source community**
+**Enjoy creating amazing voice content! 🎙️**