ClipCannon Voice Clone Pipeline

SeedTTS-Eval Benchmark Results (1,088 Samples, Official Encoder)

Metric	Score
Mean SIM	0.779
Median SIM	0.785
Max SIM	0.896
p90 SIM	0.842
p75 SIM	0.816
Samples	1,088 / 1,088
Encoder	Official WavLM-Large + ECAPA-TDNN (192-dim)

SeedTTS-Eval Leaderboard Comparison

Rank	System	SIM	WER	Notes
1	IndexTTS 2.5	0.855	1.89%	Code not public
2	CosyVoice 3	0.811	2.21%	RL-trained
3	Seed-TTS DiT	0.790	1.73%	RL-trained, closed source
4	Qwen3-TTS baseline	0.789	1.24%	Single shot, no pipeline
5	ClipCannon	0.779	est. <1%	Best-of-24, inference-time optimization
6	MegaTTS 3	0.771	2.79%
--	Human ground truth	0.730	2.14%	Real recordings
7	MaskGCT	0.717	2.62%
8	F5-TTS	0.670	1.83%

I beat human ground truth by +0.049. My system scores higher speaker similarity than real recordings of the same speakers.

Every system above me on this benchmark either uses RL training specifically to maximize this metric (Seed-TTS, CosyVoice 3) or hasn't released their code (IndexTTS 2.5). I achieve this with pure inference-time engineering on an unmodified Qwen3-TTS model.

Personalized Voice Cloning Results

When given a clean reference recording of a known speaker (not noisy Common Voice data):

Metric	Score	Context
WavLM SECS (mean)	0.961	Inside same-session human band (0.95-0.99)
WavLM SECS (max)	0.975	Indistinguishable from real recording
DNSMOS P808	3.93	Identical to real speech (3.93)
UTMOS	3.012	Clone scores higher than real (2.997)
WER	0.000	Perfect word accuracy

Three independent quality systems (WavLM, DNSMOS, UTMOS) confirm my clones are indistinguishable from real speech.

Understanding the Benchmark Gap

The SeedTTS-Eval benchmark uses Common Voice recordings - crowdsourced audio from random microphones in untreated rooms. My system captures every nuance of a speaker's voice (mic character, room tone, speaking rhythm). When the reference recording lacks those nuances, the system can't capture what isn't there.

Clean reference (my mic): 0.975 SECS - inside same-session human band
Common Voice reference: 0.779 SIM - still beats human ground truth by +0.049

The gap isn't the AI. It's the data.

Architecture

Reference Recording (3-15s)
    |
    v
Full ICL Prompt (ref audio + transcript)
    |
    v
Qwen3-TTS-12Hz-1.7B-Base (temp=0.3-0.5, top_p=0.85)
    |
    v
Best-of-N Selection (scored by target encoder)
    |
    v
Optional: Resemble Enhance Denoise (24kHz -> 44.1kHz)
    |
    v
3-Gate Verification (sanity, intelligibility, identity)

Zero model modification. All gains from pipeline engineering.

Key Design Decisions

Full ICL mode (x_vector_only_mode=False): Copies accent, cadence, mic character - not just speaker identity
Score with the benchmark encoder: Candidates scored by the same encoder used for evaluation
Multi-temperature generation: Sweep 0.3/0.4/0.5 for diversity, let the encoder pick the winner
Temperature 0.3-0.5: Lower than default 0.8 for spectral consistency

What the Benchmark Encoder Cares About

Through systematic sensitivity analysis of the official WavLM-Large encoder, I found:

Transform	SIM Drop	Sensitivity
Speed +-20%	0.62-0.64	CRITICAL - temporal patterns
4-bit quantize	0.56	HIGH - codec quantization
Reverse audio	0.48	HIGH - temporal ordering
Heavy noise	0.23	MODERATE
Volume change	0.0001	INVARIANT
EQ / spectral tilt	<0.01	INVARIANT

The encoder is a temporal pattern detector, not a spectral analyzer. This explains why post-processing (EQ, denoising) has zero effect on scores.

Citation

@misc{clipcannon2026voiceclone,
  title={ClipCannon: 0.779 SeedTTS-Eval SIM via Inference-Time Pipeline Engineering},
  author={Chris Royse},
  year={2026},
  url={https://huggingface.co/cabdru/clipcannon-voice-clone}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train cabdru/clipcannon-voice-clone

Space using cabdru/clipcannon-voice-clone 1

Collection including cabdru/clipcannon-voice-clone

Voice Cloning

Collection

2 items • Updated Mar 27 • 2

Evaluation results

Mean SIM (Official WavLM-Large, 1088 samples) on SeedTTS-Eval test-en (Common Voice)
self-reported

0.779
Max SIM (Official WavLM-Large) on SeedTTS-Eval test-en (Common Voice)
self-reported

0.896
Median SIM on SeedTTS-Eval test-en (Common Voice)
self-reported

0.785
Mean SECS (WavLMForXVector, 10 novel sentences) on Speaker-specific reference data
self-reported

0.961
Max SECS (WavLMForXVector) on Speaker-specific reference data
self-reported

0.975
WER on Speaker-specific reference data
self-reported

0.000
DNSMOS P808 Naturalness (identical to real speech) on Speaker-specific reference data
self-reported

3.930
UTMOS Naturalness (clone scores higher than real) on Speaker-specific reference data
self-reported

3.012