Spaces:

WSYBYT
/

ybtts

Running

App Files Files Community

ybtts / README.md

masbudjj

Complete Solution: Advanced TTS with Real Voices + Voice Cloning (#12)

43912f2 verified 6 months ago

8.61 kB

title: Advanced TTS - Real Voices + Voice Cloning
emoji: 🎙️
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
license: apache-2.0

🎙️ Advanced Text-to-Speech System

7 Authentic Voices + Voice Cloning + Unlimited Text - 100% Browser-Based

✨ Key Features

🎭 Dual Voice Modes

📚 Preset Voices (7 Authentic Speakers)

Real speaker embeddings from the CMU ARCTIC dataset:

🇺🇸 American Voices:

Sarah (slt) - Female, Clear & Professional
Clara (clb) - Female, Warm & Friendly
Ben (bdl) - Male, Deep & Authoritative
Robert (rms) - Male, Calm & Relaxed

🌍 International Voices:

Andrew (awb) - Scottish Male, Distinguished
James (jmk) - Canadian Male, Friendly
Kiran (ksp) - Indian Male, Professional

🎤 Voice Cloning Mode

Upload your own voice sample (up to 1 minute) and the system will:

Extract voice characteristics
Auto-compress large files
Resample to optimal quality (16kHz)
Convert stereo to mono
Generate 512-dim voice embedding

Supported formats: WAV, MP3 Max duration: 60 seconds (auto-trim) Processing: Automatic compression & resampling

📝 Unlimited Text Processing

Smart Chunking System

Automatic splitting - Intelligently splits by sentences
200 chars per chunk - Optimal for quality & speed
Seamless concatenation - Merges all chunks into single audio
Real-time progress - Track each chunk being processed

No character limits! Type as much text as you want.

🎨 Advanced Features

⚙️ Audio Controls

Speed Control - 0.5x to 2.0x playback speed
Real-time adjustment - Change speed during playback

📊 Live Monitoring

Character counter - Total text length
Word counter - Word count
Chunk calculator - Estimated processing chunks
Progress bar - Visual generation progress
Activity log - Detailed processing steps

💾 Download & Playback

Browser audio player - Built-in controls
WAV format - High-quality 16-bit PCM
Download option - Save generated audio

🏗️ Technical Architecture

Model & Runtime

Base Model: Microsoft SpeechT5 (Xenova/speecht5_tts)
Runtime: ONNX Runtime (WebAssembly)
Framework: Transformers.js 3.1.2
Execution: 100% client-side (no server)

Voice System

Speaker Embeddings: 512-dimensional x-vectors
Dataset: CMU ARCTIC (7 speakers)
Cloning: Web Audio API + spectral analysis
Format: Float32Array, normalized

Audio Processing

Input Audio
    ↓
Duration Check (trim if > 60s)
    ↓
Resample to 16kHz
    ↓
Convert to Mono
    ↓
Extract Features (mean, variance, spectral)
    ↓
Generate 512-dim Embedding
    ↓
Normalize (L2 norm)
    ↓
Ready for TTS

Text Processing Pipeline

User Input Text
    ↓
Split by Sentences
    ↓
Group into 200-char Chunks
    ↓
Process Each Chunk:
  - Generate with TTS
  - Use selected voice embedding
  - Update progress
    ↓
Concatenate All Audio
    ↓
Encode to WAV
    ↓
Present to User

🚀 How It Works

Preset Voice Generation

Select voice from dropdown (e.g., "Sarah - Female")
Enter text (unlimited length)
Click "Generate Speech"
System splits text into chunks
Processes each chunk with selected voice
Concatenates all audio
Presents final WAV file

Voice Cloning Workflow

Switch to "Voice Clone" mode
Upload voice sample (WAV/MP3, max 60s)
Click "Process Voice Sample"
System extracts voice characteristics
Enter text to generate
Click "Generate Speech"
Your voice clone reads the text!

💻 Browser Requirements

Minimum Requirements:

Modern browser (Chrome 90+, Firefox 88+, Safari 14+)
JavaScript enabled
~100MB RAM for model
~50MB storage for model cache

Optimal Experience:

Chrome/Edge with WebGPU support
4GB+ RAM
Fast internet (first load only)

📊 Performance

Metric	Value
Model Size	~50MB (cached after first load)
Voice Load Time	~5-10s (first time only)
Generation Speed	~2-5s per 200 chars
Sample Rate	16kHz
Audio Format	WAV (16-bit PCM)
Max Text Length	Unlimited (chunked)

🎯 Use Cases

Professional

Corporate videos - Ben (authoritative), Robert (calm)
Training materials - Sarah (clear), Kiran (professional)
Presentations - Clara (warm), James (friendly)

Creative

Audiobooks - Andrew (distinguished), Robert (relaxed)
Podcasts - Use voice cloning for consistency
Voice-overs - Multiple character voices

Accessibility

Screen readers - Clear, natural voices
Language learning - Different accents
Content accessibility - Convert text to audio

🔧 Technical Details

Voice Embedding Extraction (Cloning)

// Simplified process
1. Load audio file
2. Decode to AudioBuffer
3. Resample to 16kHz if needed
4. Convert stereo → mono
5. Split into 512 chunks
6. Calculate mean & variance per chunk
7. Combine to create embedding
8. Normalize (L2 norm = 1)

Chunking Algorithm

function chunkText(text, maxChars = 200) {
  // Split by sentence boundaries
  const sentences = text.match(/[^.!?]+[.!?]+/g);

  // Group sentences into chunks ≤ maxChars
  const chunks = [];
  let currentChunk = "";

  for (const sentence of sentences) {
    if ((currentChunk + sentence).length <= maxChars) {
      currentChunk += sentence;
    } else {
      chunks.push(currentChunk.trim());
      currentChunk = sentence;
    }
  }

  return chunks;
}

Audio Concatenation

function concatenateAudio(audioArrays, sampleRate) {
  // Calculate total length
  const totalLength = audioArrays.reduce((sum, arr) =>
    sum + arr.length, 0);

  // Merge all chunks
  const result = new Float32Array(totalLength);
  let offset = 0;

  for (const arr of audioArrays) {
    result.set(arr, offset);
    offset += arr.length;
  }

  return result;
}

🌟 Advantages

✅ Privacy-Focused - All processing in your browser ✅ No Server Costs - No backend infrastructure needed ✅ Offline Capable - Works after initial model download ✅ Unlimited Usage - No API limits or quotas ✅ Fast Generation - Optimized chunking for speed ✅ High Quality - Microsoft SpeechT5 architecture ✅ Free & Open - Apache 2.0 license

📝 Limitations

⚠️ Voice Cloning Accuracy - Simplified algorithm (not production-grade) ⚠️ First Load Time - ~50MB model download ⚠️ Browser Only - Requires modern web browser ⚠️ English Optimized - Best results with English text ⚠️ Memory Usage - Large texts require more RAM

🔍 Comparison

Feature	This App	Standard SpeechT5	Cloud TTS APIs
Voices	7 real + cloning	1 default	100+
Text Length	Unlimited	Limited	Varies
Voice Cloning	✅ Yes	❌ No	✅ Yes (paid)
Privacy	✅ 100% local	✅ 100% local	❌ Cloud
Cost	Free	Free	Paid
Internet	First load only	First load only	Always
Chunking	✅ Automatic	❌ Manual	✅ Handled

🛠️ Development

Project Structure

.
├── index.html              # Main application
├── assets/
│   └── style.css          # Modern UI styling
├── README.md              # This file
└── upload_script.py       # Hugging Face upload utility

Technology Stack

Frontend: Vanilla JavaScript (ES6+)
ML Framework: Transformers.js
Runtime: ONNX Runtime (WASM)
Audio Processing: Web Audio API
Model: Xenova/speecht5_tts
Embeddings: CMU ARCTIC x-vectors

📄 License

Apache 2.0 - Free for personal and commercial use

🙏 Credits

SpeechT5 Model: Microsoft Research
ONNX Conversion: Xenova/transformers.js
Speaker Dataset: CMU ARCTIC
UI Design: Modern glassmorphism
Voice Cloning: Web Audio API

📚 Resources

Built with ❤️ using Transformers.js - Bringing AI to the Browser