ClipCannon Voice Clone Pipeline

SeedTTS-Eval Benchmark Results (1,088 Samples, Official Encoder)

Metric Score
Mean SIM 0.779
Median SIM 0.785
Max SIM 0.896
p90 SIM 0.842
p75 SIM 0.816
Samples 1,088 / 1,088
Encoder Official WavLM-Large + ECAPA-TDNN (192-dim)

SeedTTS-Eval Leaderboard Comparison

Rank System SIM WER Notes
1 IndexTTS 2.5 0.855 1.89% Code not public
2 CosyVoice 3 0.811 2.21% RL-trained
3 Seed-TTS DiT 0.790 1.73% RL-trained, closed source
4 Qwen3-TTS baseline 0.789 1.24% Single shot, no pipeline
5 ClipCannon 0.779 est. <1% Best-of-24, inference-time optimization
6 MegaTTS 3 0.771 2.79%
-- Human ground truth 0.730 2.14% Real recordings
7 MaskGCT 0.717 2.62%
8 F5-TTS 0.670 1.83%

I beat human ground truth by +0.049. My system scores higher speaker similarity than real recordings of the same speakers.

Every system above me on this benchmark either uses RL training specifically to maximize this metric (Seed-TTS, CosyVoice 3) or hasn't released their code (IndexTTS 2.5). I achieve this with pure inference-time engineering on an unmodified Qwen3-TTS model.

Personalized Voice Cloning Results

When given a clean reference recording of a known speaker (not noisy Common Voice data):

Metric Score Context
WavLM SECS (mean) 0.961 Inside same-session human band (0.95-0.99)
WavLM SECS (max) 0.975 Indistinguishable from real recording
DNSMOS P808 3.93 Identical to real speech (3.93)
UTMOS 3.012 Clone scores higher than real (2.997)
WER 0.000 Perfect word accuracy

Three independent quality systems (WavLM, DNSMOS, UTMOS) confirm my clones are indistinguishable from real speech.

Understanding the Benchmark Gap

The SeedTTS-Eval benchmark uses Common Voice recordings - crowdsourced audio from random microphones in untreated rooms. My system captures every nuance of a speaker's voice (mic character, room tone, speaking rhythm). When the reference recording lacks those nuances, the system can't capture what isn't there.

  • Clean reference (my mic): 0.975 SECS - inside same-session human band
  • Common Voice reference: 0.779 SIM - still beats human ground truth by +0.049

The gap isn't the AI. It's the data.

Architecture

Reference Recording (3-15s)
    |
    v
Full ICL Prompt (ref audio + transcript)
    |
    v
Qwen3-TTS-12Hz-1.7B-Base (temp=0.3-0.5, top_p=0.85)
    |
    v
Best-of-N Selection (scored by target encoder)
    |
    v
Optional: Resemble Enhance Denoise (24kHz -> 44.1kHz)
    |
    v
3-Gate Verification (sanity, intelligibility, identity)

Zero model modification. All gains from pipeline engineering.

Key Design Decisions

  1. Full ICL mode (x_vector_only_mode=False): Copies accent, cadence, mic character - not just speaker identity
  2. Score with the benchmark encoder: Candidates scored by the same encoder used for evaluation
  3. Multi-temperature generation: Sweep 0.3/0.4/0.5 for diversity, let the encoder pick the winner
  4. Temperature 0.3-0.5: Lower than default 0.8 for spectral consistency

What the Benchmark Encoder Cares About

Through systematic sensitivity analysis of the official WavLM-Large encoder, I found:

Transform SIM Drop Sensitivity
Speed +-20% 0.62-0.64 CRITICAL - temporal patterns
4-bit quantize 0.56 HIGH - codec quantization
Reverse audio 0.48 HIGH - temporal ordering
Heavy noise 0.23 MODERATE
Volume change 0.0001 INVARIANT
EQ / spectral tilt <0.01 INVARIANT

The encoder is a temporal pattern detector, not a spectral analyzer. This explains why post-processing (EQ, denoising) has zero effect on scores.

Citation

@misc{clipcannon2026voiceclone,
  title={ClipCannon: 0.779 SeedTTS-Eval SIM via Inference-Time Pipeline Engineering},
  author={Chris Royse},
  year={2026},
  url={https://huggingface.co/cabdru/clipcannon-voice-clone}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cabdru/clipcannon-voice-clone

Space using cabdru/clipcannon-voice-clone 1

Collection including cabdru/clipcannon-voice-clone

Evaluation results

  • Mean SIM (Official WavLM-Large, 1088 samples) on SeedTTS-Eval test-en (Common Voice)
    self-reported
    0.779
  • Max SIM (Official WavLM-Large) on SeedTTS-Eval test-en (Common Voice)
    self-reported
    0.896
  • Median SIM on SeedTTS-Eval test-en (Common Voice)
    self-reported
    0.785
  • Mean SECS (WavLMForXVector, 10 novel sentences) on Speaker-specific reference data
    self-reported
    0.961
  • Max SECS (WavLMForXVector) on Speaker-specific reference data
    self-reported
    0.975
  • WER on Speaker-specific reference data
    self-reported
    0.000
  • DNSMOS P808 Naturalness (identical to real speech) on Speaker-specific reference data
    self-reported
    3.930
  • UTMOS Naturalness (clone scores higher than real) on Speaker-specific reference data
    self-reported
    3.012