Spaces:
Running
title: Advanced TTS - Real Voices + Voice Cloning
emoji: ποΈ
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
license: apache-2.0
ποΈ Advanced Text-to-Speech System
7 Authentic Voices + Voice Cloning + Unlimited Text - 100% Browser-Based
β¨ Key Features
π Dual Voice Modes
π Preset Voices (7 Authentic Speakers)
Real speaker embeddings from the CMU ARCTIC dataset:
πΊπΈ American Voices:
- Sarah (slt) - Female, Clear & Professional
- Clara (clb) - Female, Warm & Friendly
- Ben (bdl) - Male, Deep & Authoritative
- Robert (rms) - Male, Calm & Relaxed
π International Voices:
- Andrew (awb) - Scottish Male, Distinguished
- James (jmk) - Canadian Male, Friendly
- Kiran (ksp) - Indian Male, Professional
π€ Voice Cloning Mode
Upload your own voice sample (up to 1 minute) and the system will:
- Extract voice characteristics
- Auto-compress large files
- Resample to optimal quality (16kHz)
- Convert stereo to mono
- Generate 512-dim voice embedding
Supported formats: WAV, MP3 Max duration: 60 seconds (auto-trim) Processing: Automatic compression & resampling
π Unlimited Text Processing
Smart Chunking System
- Automatic splitting - Intelligently splits by sentences
- 200 chars per chunk - Optimal for quality & speed
- Seamless concatenation - Merges all chunks into single audio
- Real-time progress - Track each chunk being processed
No character limits! Type as much text as you want.
π¨ Advanced Features
βοΈ Audio Controls
- Speed Control - 0.5x to 2.0x playback speed
- Real-time adjustment - Change speed during playback
π Live Monitoring
- Character counter - Total text length
- Word counter - Word count
- Chunk calculator - Estimated processing chunks
- Progress bar - Visual generation progress
- Activity log - Detailed processing steps
πΎ Download & Playback
- Browser audio player - Built-in controls
- WAV format - High-quality 16-bit PCM
- Download option - Save generated audio
ποΈ Technical Architecture
Model & Runtime
- Base Model: Microsoft SpeechT5 (Xenova/speecht5_tts)
- Runtime: ONNX Runtime (WebAssembly)
- Framework: Transformers.js 3.1.2
- Execution: 100% client-side (no server)
Voice System
- Speaker Embeddings: 512-dimensional x-vectors
- Dataset: CMU ARCTIC (7 speakers)
- Cloning: Web Audio API + spectral analysis
- Format: Float32Array, normalized
Audio Processing
Input Audio
β
Duration Check (trim if > 60s)
β
Resample to 16kHz
β
Convert to Mono
β
Extract Features (mean, variance, spectral)
β
Generate 512-dim Embedding
β
Normalize (L2 norm)
β
Ready for TTS
Text Processing Pipeline
User Input Text
β
Split by Sentences
β
Group into 200-char Chunks
β
Process Each Chunk:
- Generate with TTS
- Use selected voice embedding
- Update progress
β
Concatenate All Audio
β
Encode to WAV
β
Present to User
π How It Works
Preset Voice Generation
- Select voice from dropdown (e.g., "Sarah - Female")
- Enter text (unlimited length)
- Click "Generate Speech"
- System splits text into chunks
- Processes each chunk with selected voice
- Concatenates all audio
- Presents final WAV file
Voice Cloning Workflow
- Switch to "Voice Clone" mode
- Upload voice sample (WAV/MP3, max 60s)
- Click "Process Voice Sample"
- System extracts voice characteristics
- Enter text to generate
- Click "Generate Speech"
- Your voice clone reads the text!
π» Browser Requirements
Minimum Requirements:
- Modern browser (Chrome 90+, Firefox 88+, Safari 14+)
- JavaScript enabled
- ~100MB RAM for model
- ~50MB storage for model cache
Optimal Experience:
- Chrome/Edge with WebGPU support
- 4GB+ RAM
- Fast internet (first load only)
π Performance
| Metric | Value |
|---|---|
| Model Size | ~50MB (cached after first load) |
| Voice Load Time | ~5-10s (first time only) |
| Generation Speed | ~2-5s per 200 chars |
| Sample Rate | 16kHz |
| Audio Format | WAV (16-bit PCM) |
| Max Text Length | Unlimited (chunked) |
π― Use Cases
Professional
- Corporate videos - Ben (authoritative), Robert (calm)
- Training materials - Sarah (clear), Kiran (professional)
- Presentations - Clara (warm), James (friendly)
Creative
- Audiobooks - Andrew (distinguished), Robert (relaxed)
- Podcasts - Use voice cloning for consistency
- Voice-overs - Multiple character voices
Accessibility
- Screen readers - Clear, natural voices
- Language learning - Different accents
- Content accessibility - Convert text to audio
π§ Technical Details
Voice Embedding Extraction (Cloning)
// Simplified process
1. Load audio file
2. Decode to AudioBuffer
3. Resample to 16kHz if needed
4. Convert stereo β mono
5. Split into 512 chunks
6. Calculate mean & variance per chunk
7. Combine to create embedding
8. Normalize (L2 norm = 1)
Chunking Algorithm
function chunkText(text, maxChars = 200) {
// Split by sentence boundaries
const sentences = text.match(/[^.!?]+[.!?]+/g);
// Group sentences into chunks β€ maxChars
const chunks = [];
let currentChunk = "";
for (const sentence of sentences) {
if ((currentChunk + sentence).length <= maxChars) {
currentChunk += sentence;
} else {
chunks.push(currentChunk.trim());
currentChunk = sentence;
}
}
return chunks;
}
Audio Concatenation
function concatenateAudio(audioArrays, sampleRate) {
// Calculate total length
const totalLength = audioArrays.reduce((sum, arr) =>
sum + arr.length, 0);
// Merge all chunks
const result = new Float32Array(totalLength);
let offset = 0;
for (const arr of audioArrays) {
result.set(arr, offset);
offset += arr.length;
}
return result;
}
π Advantages
β Privacy-Focused - All processing in your browser β No Server Costs - No backend infrastructure needed β Offline Capable - Works after initial model download β Unlimited Usage - No API limits or quotas β Fast Generation - Optimized chunking for speed β High Quality - Microsoft SpeechT5 architecture β Free & Open - Apache 2.0 license
π Limitations
β οΈ Voice Cloning Accuracy - Simplified algorithm (not production-grade) β οΈ First Load Time - ~50MB model download β οΈ Browser Only - Requires modern web browser β οΈ English Optimized - Best results with English text β οΈ Memory Usage - Large texts require more RAM
π Comparison
| Feature | This App | Standard SpeechT5 | Cloud TTS APIs |
|---|---|---|---|
| Voices | 7 real + cloning | 1 default | 100+ |
| Text Length | Unlimited | Limited | Varies |
| Voice Cloning | β Yes | β No | β Yes (paid) |
| Privacy | β 100% local | β 100% local | β Cloud |
| Cost | Free | Free | Paid |
| Internet | First load only | First load only | Always |
| Chunking | β Automatic | β Manual | β Handled |
π οΈ Development
Project Structure
.
βββ index.html # Main application
βββ assets/
β βββ style.css # Modern UI styling
βββ README.md # This file
βββ upload_script.py # Hugging Face upload utility
Technology Stack
- Frontend: Vanilla JavaScript (ES6+)
- ML Framework: Transformers.js
- Runtime: ONNX Runtime (WASM)
- Audio Processing: Web Audio API
- Model: Xenova/speecht5_tts
- Embeddings: CMU ARCTIC x-vectors
π License
Apache 2.0 - Free for personal and commercial use
π Credits
- SpeechT5 Model: Microsoft Research
- ONNX Conversion: Xenova/transformers.js
- Speaker Dataset: CMU ARCTIC
- UI Design: Modern glassmorphism
- Voice Cloning: Web Audio API
π Resources
Built with β€οΈ using Transformers.js - Bringing AI to the Browser