ybtts / README.md
masbudjj's picture
Complete Solution: Advanced TTS with Real Voices + Voice Cloning (#12)
43912f2 verified
|
raw
history blame
8.61 kB
metadata
title: Advanced TTS - Real Voices + Voice Cloning
emoji: πŸŽ™οΈ
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
license: apache-2.0

πŸŽ™οΈ Advanced Text-to-Speech System

7 Authentic Voices + Voice Cloning + Unlimited Text - 100% Browser-Based

✨ Key Features

🎭 Dual Voice Modes

πŸ“š Preset Voices (7 Authentic Speakers)

Real speaker embeddings from the CMU ARCTIC dataset:

πŸ‡ΊπŸ‡Έ American Voices:

  • Sarah (slt) - Female, Clear & Professional
  • Clara (clb) - Female, Warm & Friendly
  • Ben (bdl) - Male, Deep & Authoritative
  • Robert (rms) - Male, Calm & Relaxed

🌍 International Voices:

  • Andrew (awb) - Scottish Male, Distinguished
  • James (jmk) - Canadian Male, Friendly
  • Kiran (ksp) - Indian Male, Professional

🎀 Voice Cloning Mode

Upload your own voice sample (up to 1 minute) and the system will:

  • Extract voice characteristics
  • Auto-compress large files
  • Resample to optimal quality (16kHz)
  • Convert stereo to mono
  • Generate 512-dim voice embedding

Supported formats: WAV, MP3 Max duration: 60 seconds (auto-trim) Processing: Automatic compression & resampling


πŸ“ Unlimited Text Processing

Smart Chunking System

  • Automatic splitting - Intelligently splits by sentences
  • 200 chars per chunk - Optimal for quality & speed
  • Seamless concatenation - Merges all chunks into single audio
  • Real-time progress - Track each chunk being processed

No character limits! Type as much text as you want.


🎨 Advanced Features

βš™οΈ Audio Controls

  • Speed Control - 0.5x to 2.0x playback speed
  • Real-time adjustment - Change speed during playback

πŸ“Š Live Monitoring

  • Character counter - Total text length
  • Word counter - Word count
  • Chunk calculator - Estimated processing chunks
  • Progress bar - Visual generation progress
  • Activity log - Detailed processing steps

πŸ’Ύ Download & Playback

  • Browser audio player - Built-in controls
  • WAV format - High-quality 16-bit PCM
  • Download option - Save generated audio

πŸ—οΈ Technical Architecture

Model & Runtime

  • Base Model: Microsoft SpeechT5 (Xenova/speecht5_tts)
  • Runtime: ONNX Runtime (WebAssembly)
  • Framework: Transformers.js 3.1.2
  • Execution: 100% client-side (no server)

Voice System

  • Speaker Embeddings: 512-dimensional x-vectors
  • Dataset: CMU ARCTIC (7 speakers)
  • Cloning: Web Audio API + spectral analysis
  • Format: Float32Array, normalized

Audio Processing

Input Audio
    ↓
Duration Check (trim if > 60s)
    ↓
Resample to 16kHz
    ↓
Convert to Mono
    ↓
Extract Features (mean, variance, spectral)
    ↓
Generate 512-dim Embedding
    ↓
Normalize (L2 norm)
    ↓
Ready for TTS

Text Processing Pipeline

User Input Text
    ↓
Split by Sentences
    ↓
Group into 200-char Chunks
    ↓
Process Each Chunk:
  - Generate with TTS
  - Use selected voice embedding
  - Update progress
    ↓
Concatenate All Audio
    ↓
Encode to WAV
    ↓
Present to User

πŸš€ How It Works

Preset Voice Generation

  1. Select voice from dropdown (e.g., "Sarah - Female")
  2. Enter text (unlimited length)
  3. Click "Generate Speech"
  4. System splits text into chunks
  5. Processes each chunk with selected voice
  6. Concatenates all audio
  7. Presents final WAV file

Voice Cloning Workflow

  1. Switch to "Voice Clone" mode
  2. Upload voice sample (WAV/MP3, max 60s)
  3. Click "Process Voice Sample"
  4. System extracts voice characteristics
  5. Enter text to generate
  6. Click "Generate Speech"
  7. Your voice clone reads the text!

πŸ’» Browser Requirements

Minimum Requirements:

  • Modern browser (Chrome 90+, Firefox 88+, Safari 14+)
  • JavaScript enabled
  • ~100MB RAM for model
  • ~50MB storage for model cache

Optimal Experience:

  • Chrome/Edge with WebGPU support
  • 4GB+ RAM
  • Fast internet (first load only)

πŸ“Š Performance

Metric Value
Model Size ~50MB (cached after first load)
Voice Load Time ~5-10s (first time only)
Generation Speed ~2-5s per 200 chars
Sample Rate 16kHz
Audio Format WAV (16-bit PCM)
Max Text Length Unlimited (chunked)

🎯 Use Cases

Professional

  • Corporate videos - Ben (authoritative), Robert (calm)
  • Training materials - Sarah (clear), Kiran (professional)
  • Presentations - Clara (warm), James (friendly)

Creative

  • Audiobooks - Andrew (distinguished), Robert (relaxed)
  • Podcasts - Use voice cloning for consistency
  • Voice-overs - Multiple character voices

Accessibility

  • Screen readers - Clear, natural voices
  • Language learning - Different accents
  • Content accessibility - Convert text to audio

πŸ”§ Technical Details

Voice Embedding Extraction (Cloning)

// Simplified process
1. Load audio file
2. Decode to AudioBuffer
3. Resample to 16kHz if needed
4. Convert stereo β†’ mono
5. Split into 512 chunks
6. Calculate mean & variance per chunk
7. Combine to create embedding
8. Normalize (L2 norm = 1)

Chunking Algorithm

function chunkText(text, maxChars = 200) {
  // Split by sentence boundaries
  const sentences = text.match(/[^.!?]+[.!?]+/g);

  // Group sentences into chunks ≀ maxChars
  const chunks = [];
  let currentChunk = "";

  for (const sentence of sentences) {
    if ((currentChunk + sentence).length <= maxChars) {
      currentChunk += sentence;
    } else {
      chunks.push(currentChunk.trim());
      currentChunk = sentence;
    }
  }

  return chunks;
}

Audio Concatenation

function concatenateAudio(audioArrays, sampleRate) {
  // Calculate total length
  const totalLength = audioArrays.reduce((sum, arr) =>
    sum + arr.length, 0);

  // Merge all chunks
  const result = new Float32Array(totalLength);
  let offset = 0;

  for (const arr of audioArrays) {
    result.set(arr, offset);
    offset += arr.length;
  }

  return result;
}

🌟 Advantages

βœ… Privacy-Focused - All processing in your browser βœ… No Server Costs - No backend infrastructure needed βœ… Offline Capable - Works after initial model download βœ… Unlimited Usage - No API limits or quotas βœ… Fast Generation - Optimized chunking for speed βœ… High Quality - Microsoft SpeechT5 architecture βœ… Free & Open - Apache 2.0 license


πŸ“ Limitations

⚠️ Voice Cloning Accuracy - Simplified algorithm (not production-grade) ⚠️ First Load Time - ~50MB model download ⚠️ Browser Only - Requires modern web browser ⚠️ English Optimized - Best results with English text ⚠️ Memory Usage - Large texts require more RAM


πŸ” Comparison

Feature This App Standard SpeechT5 Cloud TTS APIs
Voices 7 real + cloning 1 default 100+
Text Length Unlimited Limited Varies
Voice Cloning βœ… Yes ❌ No βœ… Yes (paid)
Privacy βœ… 100% local βœ… 100% local ❌ Cloud
Cost Free Free Paid
Internet First load only First load only Always
Chunking βœ… Automatic ❌ Manual βœ… Handled

πŸ› οΈ Development

Project Structure

.
β”œβ”€β”€ index.html              # Main application
β”œβ”€β”€ assets/
β”‚   └── style.css          # Modern UI styling
β”œβ”€β”€ README.md              # This file
└── upload_script.py       # Hugging Face upload utility

Technology Stack

  • Frontend: Vanilla JavaScript (ES6+)
  • ML Framework: Transformers.js
  • Runtime: ONNX Runtime (WASM)
  • Audio Processing: Web Audio API
  • Model: Xenova/speecht5_tts
  • Embeddings: CMU ARCTIC x-vectors

πŸ“„ License

Apache 2.0 - Free for personal and commercial use


πŸ™ Credits

  • SpeechT5 Model: Microsoft Research
  • ONNX Conversion: Xenova/transformers.js
  • Speaker Dataset: CMU ARCTIC
  • UI Design: Modern glassmorphism
  • Voice Cloning: Web Audio API

πŸ“š Resources


Built with ❀️ using Transformers.js - Bringing AI to the Browser