Spaces:

ZeroPointMonkey
/

voice-clone-bench

Paused

App Files Files Community

voice-clone-bench / README.md

ZeroPointMonkey

Faithful-cloning defaults + sliders + optional background-audio removal (demucs-onnx)

c0b00e8 verified 7 days ago

preview code

raw

history blame contribute delete

2.51 kB

	---
	title: Voice Clone Bench (Chatterbox)
	emoji: 🎙️
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 5.29.0
	app_file: app.py
	pinned: false
	short_description: Zero-shot voice cloning + TTS to A/B against ElevenLabs
	---

	# Voice Clone Bench — Chatterbox Multilingual (zero-shot voice cloning)

	A standalone prototype for A/B testing open-weight voice cloning + TTS against ElevenLabs.

	Powered by [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) (Resemble AI, MIT license),
	which beats ElevenLabs in independent blind preference tests.

	## How to use (manual A/B)
	1. Upload a reference audio clip of the voice to clone (5–20 s of clean speech is ideal).
	2. (Optional) Tick 🧹 Remove background audio from reference to isolate the voice
	(HT-Demucs) before cloning if the clip has music/noise. Use Preview cleaned reference
	to hear the isolated result first.
	3. Pick the language (default: English).
	4. Type the text to speak (long scripts are auto-chunked at sentence boundaries).
	5. Click Clone & Speak → you get audio in the cloned voice.

	Tip: leave the reference empty to hear a built-in sample voice for the selected language.

	### Cloning defaults (tuned for faithful cloning)
	Tuned for speaker similarity, not expressiveness:
	`exaggeration=0.4` (neutral), `cfg_weight=0.5` (balanced; ~0.3 faster pace, 0.0 cross-lingual),
	`temperature=0.7` (consistent). All knobs are exposed as sliders.

	## API (for bot integration later)
	Gradio exposes a programmatic endpoint named `clone` (plus `isolate_voice` for
	standalone background-audio removal):

	```python
	from gradio_client import Client, handle_file

	client = Client("ZeroPointMonkey/voice-clone-bench")
	sr_path = client.predict(
	text="Hey, it's good to finally hear your voice.",
	language_id="en",
	audio_prompt_path=handle_file("reference.wav"),
	exaggeration=0.4,
	cfg_weight=0.5,
	temperature=0.7,
	seed=0,
	clean_reference=False, # True = strip background music/noise first
	repetition_penalty=2.0,
	min_p=0.05,
	top_p=1.0,
	api_name="/clone",
	)
	print(sr_path) # path to generated wav

	# Just clean a reference clip (returns isolated-voice wav):
	cleaned = client.predict(handle_file("noisy_reference.wav"), api_name="/isolate_voice")
	```

	## Notes
	- Hardware: ZeroGPU (`zero-a10g`). Outputs are PerTh-watermarked by the model.
	- License: model weights are MIT (Resemble AI / Chatterbox) — free for commercial use.