Spaces:
Running
Running
Feature: Real Voice Cloning with WavLM Speaker Encoder
#4
by
masbudjj - opened
Real Voice Cloning Implementation
New Features:
- WavLM speaker encoder for real voice extraction
- 192-dim to 512-dim embedding projection
- Audio preprocessing (16kHz resample, normalization)
- Voice preview before generation
- Blend ratio for voice stability (70% custom + 30% default)
- Toggle between default voice and cloned voice
How Voice Cloning Works:
- Upload audio file (5-30 seconds, clear voice)
- Extract features using WavLM-base-plus-sv model
- Project 192-dim embeddings to 512-dim SpeechT5 space
- Normalize and blend with default for stability
- Use custom embeddings for generation
Technology:
- Speaker Encoder: Xenova/wavlm-base-plus-sv
- Feature pooling: Mean pooling with normalization
- Projection: Linear interpolation with normalization
- Blend: 70% custom + 30% default for quality
masbudjj changed pull request status to
merged