Spaces:

WSYBYT
/

ybtts

Running

App Files Files Community

Feature: Real Voice Cloning with WavLM Speaker Encoder

#4

by masbudjj - opened Oct 22, 2025

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

WS YB YT org Oct 22, 2025

Real Voice Cloning Implementation

New Features:

WavLM speaker encoder for real voice extraction
192-dim to 512-dim embedding projection
Audio preprocessing (16kHz resample, normalization)
Voice preview before generation
Blend ratio for voice stability (70% custom + 30% default)
Toggle between default voice and cloned voice

How Voice Cloning Works:

Upload audio file (5-30 seconds, clear voice)
Extract features using WavLM-base-plus-sv model
Project 192-dim embeddings to 512-dim SpeechT5 space
Normalize and blend with default for stability
Use custom embeddings for generation

Technology:

Speaker Encoder: Xenova/wavlm-base-plus-sv
Feature pooling: Mean pooling with normalization
Projection: Linear interpolation with normalization
Blend: 70% custom + 30% default for quality

Feature: Real Voice Cloning with WavLM Speaker Encoderc898fc00

masbudjj changed pull request status to merged Oct 22, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment