Spaces:

WSYBYT
/

ybtts

Running

App Files Files Community

Complete Solution: Advanced TTS with Real Voices + Voice Cloning

#12

by masbudjj - opened Oct 22, 2025

base: refs/heads/main

←

from: refs/pr/12

Discussion Files changed

+275

-126

Files changed (1) hide show

README.md +275 -126

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Multi-Voice TTS - 24 Unique Voices
 emoji: 🎙️
 colorFrom: indigo
 colorTo: purple
@@ -8,178 +8,317 @@ pinned: false
 license: apache-2.0
 ---
-# 🎙️ Multi-Voice Text-to-Speech
-**24 Unique Voices - 100% Browser-Based - No Server Required**
-## ✨ Features
-### 🎭 24 Unique Voice Characters
-#### 🇺🇸 American Female (6 voices)
-- **Default** - Neutral baseline
-- **Warm** - Friendly & caring
-- **Bright** - Energetic & happy
-- **Soft** - Gentle & calm
-- **Clear** - Professional
-- **Smooth** - Elegant
-#### 🇺🇸 American Male (6 voices)
-- **Default** - Neutral baseline
-- **Deep** - Authoritative
-- **Friendly** - Approachable
-- **Strong** - Confident
-- **Calm** - Relaxed
-- **Professional** - Business-oriented
-#### 🇬🇧 British Female (4 voices)
-- **Refined** - Elegant
-- **Bright** - Cheerful
-- **Soft** - Gentle
-- **Clear** - Articulate
-#### 🇬🇧 British Male (4 voices)
-- **Distinguished** - Formal
-- **Smooth** - Sophisticated
-- **Warm** - Friendly
-- **Strong** - Commanding
-#### 🌏 International (4 voices)
-- **Neutral** - Standard
-- **Soft** - Gentle
-- **Clear** - Professional
-- **Warm** - Friendly
 ---
-## 🎨 Voice Customization
-Each voice can be further customized with:
-- **Pitch Control** (0.5x - 1.5x) - Adjust voice pitch
-- **Energy Control** (0.5x - 1.5x) - Modify speaking energy
-- **Speed Control** (0.5x - 2.0x) - Playback speed
-**Total Combinations:** 24 voices × unlimited pitch/energy variations = **Infinite possibilities!**
 ---
-## 🏗️ Technology
-### Base Model
-- **SpeechT5** from Microsoft
-- **ONNX Runtime** for browser execution
-- **WebAssembly** backend
-### Voice Generation
-Each of the 24 voices is created by:
-1. Taking base speaker embedding (512-dim)
-2. Applying pitch transformation
-3. Modulating energy levels
-4. Spectral shaping for character
-5. Prosody adjustment
-6. Normalization
 ---
-## 🚀 Features
-✅ **24 Unique Voices** - Diverse characters
-✅ **100% Browser-Based** - No server needed
-✅ **Voice Customization** - Pitch & energy controls
-✅ **Fast Generation** - 2-5 seconds
-✅ **High Quality** - SpeechT5 architecture
-✅ **Offline Capable** - Works after first load
-✅ **Privacy Focused** - No data sent to servers
-✅ **Free & Open Source** - Apache 2.0 license
 ---
-## 💻 How It Works
-### Voice Profile System
 ```javascript
-const VOICE_PROFILES = {
-  af_warm: {
-    pitch: 0.95,    // Slightly lower
-    energy: 1.1,    // More energetic
-    spectral: 0.2   // Brighter tone
-  },
-  am_deep: {
-    pitch: 0.7,     // Much lower
-    energy: 1.1,    // Strong
-    spectral: -0.5  // Darker tone
-  },
-  // ... 24 total profiles
-};
 ```
-### Generation Process
 ```
-User Input Text
-     ↓
-Select Voice Profile
-     ↓
-Load Base Speaker Embedding
-     ↓
-Apply Transformations:
-  - Pitch modification
-  - Energy modulation
-  - Spectral shaping
-  - User adjustments (pitch/energy sliders)
-     ↓
-Normalize Embedding
-     ↓
-SpeechT5 Generation
-     ↓
-WAV Output
 ```
 ---
-## 🎯 Use Cases
-**Professional/Corporate:**
-- af_clear, am_professional, bf_clear, bm_distinguished
-**Friendly/Casual:**
-- af_warm, am_friendly, bf_bright, int_warm
-**Storytelling/Narration:**
-- af_smooth, am_calm, bf_refined, bm_smooth
-**Energetic/Marketing:**
-- af_bright, am_strong, bf_bright
 ---
-## 📊 Comparison
-| Feature | This App | SpeechT5 Basic | Kokoro-82M |
-|---------|----------|----------------|------------|
-| **Voices** | 24 | 1 | 54 |
-| **Browser** | ✅ Yes | ✅ Yes | ❌ No |
-| **Customization** | ✅ Pitch/Energy | ❌ Limited | ✅ Yes |
-| **Server** | ❌ Not needed | ❌ Not needed | ✅ Required |
-| **Speed** | ⚡ Fast | ⚡ Fast | ⏱️ Medium |
 ---
-## 🔧 Technical Details
-**Model:** Xenova/speecht5_tts
-**Size:** ~50MB (cached after first load)
-**Format:** ONNX (quantized)
-**Sample Rate:** 16kHz
-**Output:** WAV (16-bit PCM)
-**Voice Embedding:** 512-dimensional vector
-**Transformations:** Pitch, energy, spectral
-**Normalization:** Z-score (mean=0, std=1)
 ---
-## 📝 License
 Apache 2.0 - Free for personal and commercial use
@@ -187,11 +326,21 @@ Apache 2.0 - Free for personal and commercial use
 ## 🙏 Credits
-- **Base Model:** Microsoft SpeechT5
 - **ONNX Conversion:** Xenova/transformers.js
-- **Voice Profiles:** Custom implementation
-- **UI:** Modern glassmorphism design
 ---
-**Built with ❤️ using Transformers.js**

 ---
+title: Advanced TTS - Real Voices + Voice Cloning
 emoji: 🎙️
 colorFrom: indigo
 colorTo: purple
 license: apache-2.0
 ---
+# 🎙️ Advanced Text-to-Speech System
+**7 Authentic Voices + Voice Cloning + Unlimited Text - 100% Browser-Based**
+## ✨ Key Features
+### 🎭 Dual Voice Modes
+#### 📚 Preset Voices (7 Authentic Speakers)
+Real speaker embeddings from the CMU ARCTIC dataset:
+**🇺🇸 American Voices:**
+- **Sarah (slt)** - Female, Clear & Professional
+- **Clara (clb)** - Female, Warm & Friendly
+- **Ben (bdl)** - Male, Deep & Authoritative
+- **Robert (rms)** - Male, Calm & Relaxed
+**🌍 International Voices:**
+- **Andrew (awb)** - Scottish Male, Distinguished
+- **James (jmk)** - Canadian Male, Friendly
+- **Kiran (ksp)** - Indian Male, Professional
+#### 🎤 Voice Cloning Mode
+Upload your own voice sample (up to 1 minute) and the system will:
+- Extract voice characteristics
+- Auto-compress large files
+- Resample to optimal quality (16kHz)
+- Convert stereo to mono
+- Generate 512-dim voice embedding
+**Supported formats:** WAV, MP3
+**Max duration:** 60 seconds (auto-trim)
+**Processing:** Automatic compression & resampling
 ---
+## 📝 Unlimited Text Processing
+### Smart Chunking System
+- **Automatic splitting** - Intelligently splits by sentences
+- **200 chars per chunk** - Optimal for quality & speed
+- **Seamless concatenation** - Merges all chunks into single audio
+- **Real-time progress** - Track each chunk being processed
+**No character limits!** Type as much text as you want.
+---
+## 🎨 Advanced Features
+### ⚙️ Audio Controls
+- **Speed Control** - 0.5x to 2.0x playback speed
+- **Real-time adjustment** - Change speed during playback
+### 📊 Live Monitoring
+- **Character counter** - Total text length
+- **Word counter** - Word count
+- **Chunk calculator** - Estimated processing chunks
+- **Progress bar** - Visual generation progress
+- **Activity log** - Detailed processing steps
+### 💾 Download & Playback
+- **Browser audio player** - Built-in controls
+- **WAV format** - High-quality 16-bit PCM
+- **Download option** - Save generated audio
 ---
+## 🏗️ Technical Architecture
+### Model & Runtime
+- **Base Model:** Microsoft SpeechT5 (Xenova/speecht5_tts)
+- **Runtime:** ONNX Runtime (WebAssembly)
+- **Framework:** Transformers.js 3.1.2
+- **Execution:** 100% client-side (no server)
+### Voice System
+- **Speaker Embeddings:** 512-dimensional x-vectors
+- **Dataset:** CMU ARCTIC (7 speakers)
+- **Cloning:** Web Audio API + spectral analysis
+- **Format:** Float32Array, normalized
+### Audio Processing
+```javascript
+Input Audio
+    ↓
+Duration Check (trim if > 60s)
+    ↓
+Resample to 16kHz
+    ↓
+Convert to Mono
+    ↓
+Extract Features (mean, variance, spectral)
+    ↓
+Generate 512-dim Embedding
+    ↓
+Normalize (L2 norm)
+    ↓
+Ready for TTS
+```
+### Text Processing Pipeline
+```javascript
+User Input Text
+    ↓
+Split by Sentences
+    ↓
+Group into 200-char Chunks
+    ↓
+Process Each Chunk:
+  - Generate with TTS
+  - Use selected voice embedding
+  - Update progress
+    ↓
+Concatenate All Audio
+    ↓
+Encode to WAV
+    ↓
+Present to User
+```
+---
+## �� How It Works
+### Preset Voice Generation
+1. Select voice from dropdown (e.g., "Sarah - Female")
+2. Enter text (unlimited length)
+3. Click "Generate Speech"
+4. System splits text into chunks
+5. Processes each chunk with selected voice
+6. Concatenates all audio
+7. Presents final WAV file
+### Voice Cloning Workflow
+1. Switch to "Voice Clone" mode
+2. Upload voice sample (WAV/MP3, max 60s)
+3. Click "Process Voice Sample"
+4. System extracts voice characteristics
+5. Enter text to generate
+6. Click "Generate Speech"
+7. Your voice clone reads the text!
+---
+## 💻 Browser Requirements
+**Minimum Requirements:**
+- Modern browser (Chrome 90+, Firefox 88+, Safari 14+)
+- JavaScript enabled
+- ~100MB RAM for model
+- ~50MB storage for model cache
+**Optimal Experience:**
+- Chrome/Edge with WebGPU support
+- 4GB+ RAM
+- Fast internet (first load only)
 ---
+## 📊 Performance
+| Metric | Value |
+|--------|-------|
+| **Model Size** | ~50MB (cached after first load) |
+| **Voice Load Time** | ~5-10s (first time only) |
+| **Generation Speed** | ~2-5s per 200 chars |
+| **Sample Rate** | 16kHz |
+| **Audio Format** | WAV (16-bit PCM) |
+| **Max Text Length** | Unlimited (chunked) |
 ---
+## 🎯 Use Cases
+### Professional
+- **Corporate videos** - Ben (authoritative), Robert (calm)
+- **Training materials** - Sarah (clear), Kiran (professional)
+- **Presentations** - Clara (warm), James (friendly)
+### Creative
+- **Audiobooks** - Andrew (distinguished), Robert (relaxed)
+- **Podcasts** - Use voice cloning for consistency
+- **Voice-overs** - Multiple character voices
+### Accessibility
+- **Screen readers** - Clear, natural voices
+- **Language learning** - Different accents
+- **Content accessibility** - Convert text to audio
+---
+## 🔧 Technical Details
+### Voice Embedding Extraction (Cloning)
 ```javascript
+// Simplified process
+1. Load audio file
+2. Decode to AudioBuffer
+3. Resample to 16kHz if needed
+4. Convert stereo → mono
+5. Split into 512 chunks
+6. Calculate mean & variance per chunk
+7. Combine to create embedding
+8. Normalize (L2 norm = 1)
 ```
+### Chunking Algorithm
+```javascript
+function chunkText(text, maxChars = 200) {
+  // Split by sentence boundaries
+  const sentences = text.match(/[^.!?]+[.!?]+/g);
+  // Group sentences into chunks ≤ maxChars
+  const chunks = [];
+  let currentChunk = "";
+  for (const sentence of sentences) {
+    if ((currentChunk + sentence).length <= maxChars) {
+      currentChunk += sentence;
+    } else {
+      chunks.push(currentChunk.trim());
+      currentChunk = sentence;
+    }
+  }
+  return chunks;
+}
 ```
+### Audio Concatenation
+```javascript
+function concatenateAudio(audioArrays, sampleRate) {
+  // Calculate total length
+  const totalLength = audioArrays.reduce((sum, arr) =>
+    sum + arr.length, 0);
+  // Merge all chunks
+  const result = new Float32Array(totalLength);
+  let offset = 0;
+  for (const arr of audioArrays) {
+    result.set(arr, offset);
+    offset += arr.length;
+  }
+  return result;
+}
 ```
 ---
+## 🌟 Advantages
+✅ **Privacy-Focused** - All processing in your browser
+✅ **No Server Costs** - No backend infrastructure needed
+✅ **Offline Capable** - Works after initial model download
+✅ **Unlimited Usage** - No API limits or quotas
+✅ **Fast Generation** - Optimized chunking for speed
+✅ **High Quality** - Microsoft SpeechT5 architecture
+✅ **Free & Open** - Apache 2.0 license
+---
+## 📝 Limitations
+⚠️ **Voice Cloning Accuracy** - Simplified algorithm (not production-grade)
+⚠️ **First Load Time** - ~50MB model download
+⚠️ **Browser Only** - Requires modern web browser
+⚠️ **English Optimized** - Best results with English text
+⚠️ **Memory Usage** - Large texts require more RAM
 ---
+## 🔍 Comparison
+| Feature | This App | Standard SpeechT5 | Cloud TTS APIs |
+|---------|----------|-------------------|----------------|
+| **Voices** | 7 real + cloning | 1 default | 100+ |
+| **Text Length** | Unlimited | Limited | Varies |
+| **Voice Cloning** | ✅ Yes | ❌ No | ✅ Yes (paid) |
+| **Privacy** | ✅ 100% local | ✅ 100% local | ❌ Cloud |
+| **Cost** | Free | Free | Paid |
+| **Internet** | First load only | First load only | Always |
+| **Chunking** | ✅ Automatic | ❌ Manual | ✅ Handled |
 ---
+## 🛠️ Development
+### Project Structure
+```
+.
+├── index.html              # Main application
+├── assets/
+│   └── style.css          # Modern UI styling
+├── README.md              # This file
+└── upload_script.py       # Hugging Face upload utility
+```
+### Technology Stack
+- **Frontend:** Vanilla JavaScript (ES6+)
+- **ML Framework:** Transformers.js
+- **Runtime:** ONNX Runtime (WASM)
+- **Audio Processing:** Web Audio API
+- **Model:** Xenova/speecht5_tts
+- **Embeddings:** CMU ARCTIC x-vectors
 ---
+## 📄 License
 Apache 2.0 - Free for personal and commercial use
 ## 🙏 Credits
+- **SpeechT5 Model:** Microsoft Research
 - **ONNX Conversion:** Xenova/transformers.js
+- **Speaker Dataset:** CMU ARCTIC
+- **UI Design:** Modern glassmorphism
+- **Voice Cloning:** Web Audio API
+---
+## 📚 Resources
+- [Transformers.js Docs](https://huggingface.co/docs/transformers.js)
+- [SpeechT5 Paper](https://arxiv.org/abs/2110.07205)
+- [CMU ARCTIC Dataset](http://www.festvox.org/cmu_arctic/)
+- [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API)
 ---
+**Built with ❤️ using Transformers.js - Bringing AI to the Browser**