Faithful-cloning defaults + sliders + optional background-audio removal (demucs-onnx)
c0b00e8 verified | title: Voice Clone Bench (Chatterbox) | |
| emoji: ποΈ | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.29.0 | |
| app_file: app.py | |
| pinned: false | |
| short_description: Zero-shot voice cloning + TTS to A/B against ElevenLabs | |
| # Voice Clone Bench β Chatterbox Multilingual (zero-shot voice cloning) | |
| A standalone prototype for A/B testing open-weight **voice cloning + TTS** against ElevenLabs. | |
| Powered by **[Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox)** (Resemble AI, MIT license), | |
| which beats ElevenLabs in independent blind preference tests. | |
| ## How to use (manual A/B) | |
| 1. Upload a **reference audio** clip of the voice to clone (5β20 s of clean speech is ideal). | |
| 2. (Optional) Tick **π§Ή Remove background audio from reference** to isolate the voice | |
| (HT-Demucs) before cloning if the clip has music/noise. Use **Preview cleaned reference** | |
| to hear the isolated result first. | |
| 3. Pick the **language** (default: English). | |
| 4. Type the **text** to speak (long scripts are auto-chunked at sentence boundaries). | |
| 5. Click **Clone & Speak** β you get audio in the cloned voice. | |
| Tip: leave the reference empty to hear a built-in sample voice for the selected language. | |
| ### Cloning defaults (tuned for faithful cloning) | |
| Tuned for **speaker similarity**, not expressiveness: | |
| `exaggeration=0.4` (neutral), `cfg_weight=0.5` (balanced; ~0.3 faster pace, 0.0 cross-lingual), | |
| `temperature=0.7` (consistent). All knobs are exposed as sliders. | |
| ## API (for bot integration later) | |
| Gradio exposes a programmatic endpoint named **`clone`** (plus **`isolate_voice`** for | |
| standalone background-audio removal): | |
| ```python | |
| from gradio_client import Client, handle_file | |
| client = Client("ZeroPointMonkey/voice-clone-bench") | |
| sr_path = client.predict( | |
| text="Hey, it's good to finally hear your voice.", | |
| language_id="en", | |
| audio_prompt_path=handle_file("reference.wav"), | |
| exaggeration=0.4, | |
| cfg_weight=0.5, | |
| temperature=0.7, | |
| seed=0, | |
| clean_reference=False, # True = strip background music/noise first | |
| repetition_penalty=2.0, | |
| min_p=0.05, | |
| top_p=1.0, | |
| api_name="/clone", | |
| ) | |
| print(sr_path) # path to generated wav | |
| # Just clean a reference clip (returns isolated-voice wav): | |
| cleaned = client.predict(handle_file("noisy_reference.wav"), api_name="/isolate_voice") | |
| ``` | |
| ## Notes | |
| - Hardware: ZeroGPU (`zero-a10g`). Outputs are PerTh-watermarked by the model. | |
| - License: model weights are **MIT** (Resemble AI / Chatterbox) β free for commercial use. | |