File size: 2,514 Bytes
af18463
5a4878b
 
 
 
af18463
5a4878b
af18463
 
5a4878b
af18463
 
5a4878b
 
 
 
 
 
 
 
 
c0b00e8
 
 
 
 
 
5a4878b
 
 
c0b00e8
 
 
 
 
5a4878b
c0b00e8
 
5a4878b
 
 
 
 
 
 
 
 
c0b00e8
5a4878b
c0b00e8
5a4878b
c0b00e8
 
 
 
5a4878b
 
 
c0b00e8
 
 
5a4878b
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: Voice Clone Bench (Chatterbox)
emoji: 🎙️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
short_description: Zero-shot voice cloning + TTS to A/B against ElevenLabs
---

# Voice Clone Bench — Chatterbox Multilingual (zero-shot voice cloning)

A standalone prototype for A/B testing open-weight **voice cloning + TTS** against ElevenLabs.

Powered by **[Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox)** (Resemble AI, MIT license),
which beats ElevenLabs in independent blind preference tests.

## How to use (manual A/B)
1. Upload a **reference audio** clip of the voice to clone (5–20 s of clean speech is ideal).
2. (Optional) Tick **🧹 Remove background audio from reference** to isolate the voice
   (HT-Demucs) before cloning if the clip has music/noise. Use **Preview cleaned reference**
   to hear the isolated result first.
3. Pick the **language** (default: English).
4. Type the **text** to speak (long scripts are auto-chunked at sentence boundaries).
5. Click **Clone & Speak** → you get audio in the cloned voice.

Tip: leave the reference empty to hear a built-in sample voice for the selected language.

### Cloning defaults (tuned for faithful cloning)
Tuned for **speaker similarity**, not expressiveness:
`exaggeration=0.4` (neutral), `cfg_weight=0.5` (balanced; ~0.3 faster pace, 0.0 cross-lingual),
`temperature=0.7` (consistent). All knobs are exposed as sliders.

## API (for bot integration later)
Gradio exposes a programmatic endpoint named **`clone`** (plus **`isolate_voice`** for
standalone background-audio removal):

```python
from gradio_client import Client, handle_file

client = Client("ZeroPointMonkey/voice-clone-bench")
sr_path = client.predict(
    text="Hey, it's good to finally hear your voice.",
    language_id="en",
    audio_prompt_path=handle_file("reference.wav"),
    exaggeration=0.4,
    cfg_weight=0.5,
    temperature=0.7,
    seed=0,
    clean_reference=False,   # True = strip background music/noise first
    repetition_penalty=2.0,
    min_p=0.05,
    top_p=1.0,
    api_name="/clone",
)
print(sr_path)  # path to generated wav

# Just clean a reference clip (returns isolated-voice wav):
cleaned = client.predict(handle_file("noisy_reference.wav"), api_name="/isolate_voice")
```

## Notes
- Hardware: ZeroGPU (`zero-a10g`). Outputs are PerTh-watermarked by the model.
- License: model weights are **MIT** (Resemble AI / Chatterbox) — free for commercial use.