Feature: Real Voice Cloning with WavLM Speaker Encoder

#4
by masbudjj - opened
WS YB YT org

Real Voice Cloning Implementation

New Features:

  • WavLM speaker encoder for real voice extraction
  • 192-dim to 512-dim embedding projection
  • Audio preprocessing (16kHz resample, normalization)
  • Voice preview before generation
  • Blend ratio for voice stability (70% custom + 30% default)
  • Toggle between default voice and cloned voice

How Voice Cloning Works:

  1. Upload audio file (5-30 seconds, clear voice)
  2. Extract features using WavLM-base-plus-sv model
  3. Project 192-dim embeddings to 512-dim SpeechT5 space
  4. Normalize and blend with default for stability
  5. Use custom embeddings for generation

Technology:

  • Speaker Encoder: Xenova/wavlm-base-plus-sv
  • Feature pooling: Mean pooling with normalization
  • Projection: Linear interpolation with normalization
  • Blend: 70% custom + 30% default for quality
masbudjj changed pull request status to merged

Sign up or log in to comment