voice-clone-bench / README.md
ZeroPointMonkey's picture
Faithful-cloning defaults + sliders + optional background-audio removal (demucs-onnx)
c0b00e8 verified

A newer version of the Gradio SDK is available: 6.16.0

Upgrade
metadata
title: Voice Clone Bench (Chatterbox)
emoji: 🎙️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
short_description: Zero-shot voice cloning + TTS to A/B against ElevenLabs

Voice Clone Bench — Chatterbox Multilingual (zero-shot voice cloning)

A standalone prototype for A/B testing open-weight voice cloning + TTS against ElevenLabs.

Powered by Chatterbox Multilingual (Resemble AI, MIT license), which beats ElevenLabs in independent blind preference tests.

How to use (manual A/B)

  1. Upload a reference audio clip of the voice to clone (5–20 s of clean speech is ideal).
  2. (Optional) Tick 🧹 Remove background audio from reference to isolate the voice (HT-Demucs) before cloning if the clip has music/noise. Use Preview cleaned reference to hear the isolated result first.
  3. Pick the language (default: English).
  4. Type the text to speak (long scripts are auto-chunked at sentence boundaries).
  5. Click Clone & Speak → you get audio in the cloned voice.

Tip: leave the reference empty to hear a built-in sample voice for the selected language.

Cloning defaults (tuned for faithful cloning)

Tuned for speaker similarity, not expressiveness: exaggeration=0.4 (neutral), cfg_weight=0.5 (balanced; ~0.3 faster pace, 0.0 cross-lingual), temperature=0.7 (consistent). All knobs are exposed as sliders.

API (for bot integration later)

Gradio exposes a programmatic endpoint named clone (plus isolate_voice for standalone background-audio removal):

from gradio_client import Client, handle_file

client = Client("ZeroPointMonkey/voice-clone-bench")
sr_path = client.predict(
    text="Hey, it's good to finally hear your voice.",
    language_id="en",
    audio_prompt_path=handle_file("reference.wav"),
    exaggeration=0.4,
    cfg_weight=0.5,
    temperature=0.7,
    seed=0,
    clean_reference=False,   # True = strip background music/noise first
    repetition_penalty=2.0,
    min_p=0.05,
    top_p=1.0,
    api_name="/clone",
)
print(sr_path)  # path to generated wav

# Just clean a reference clip (returns isolated-voice wav):
cleaned = client.predict(handle_file("noisy_reference.wav"), api_name="/isolate_voice")

Notes

  • Hardware: ZeroGPU (zero-a10g). Outputs are PerTh-watermarked by the model.
  • License: model weights are MIT (Resemble AI / Chatterbox) — free for commercial use.