Update with SeedTTS-Eval final results: 0.779 SIM on 1088 samples, beats human GT by +0.049
Browse files
README.md
CHANGED
|
@@ -9,6 +9,7 @@ tags:
|
|
| 9 |
- speech-synthesis
|
| 10 |
datasets:
|
| 11 |
- custom
|
|
|
|
| 12 |
metrics:
|
| 13 |
- speaker_similarity
|
| 14 |
- wer
|
|
@@ -16,166 +17,153 @@ pipeline_tag: text-to-speech
|
|
| 16 |
model-index:
|
| 17 |
- name: ClipCannon Voice Clone Pipeline
|
| 18 |
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
- task:
|
| 20 |
type: text-to-speech
|
| 21 |
name: Personalized Voice Cloning
|
| 22 |
dataset:
|
| 23 |
type: custom
|
| 24 |
-
name:
|
| 25 |
metrics:
|
| 26 |
- type: speaker_similarity
|
| 27 |
value: 0.961
|
| 28 |
-
name: Mean SECS (WavLMForXVector
|
| 29 |
- type: speaker_similarity
|
| 30 |
value: 0.975
|
| 31 |
-
name: Max SECS (WavLMForXVector
|
| 32 |
- type: wer
|
| 33 |
value: 0.0
|
| 34 |
name: WER
|
| 35 |
- type: dnsmos
|
| 36 |
value: 3.93
|
| 37 |
-
name: DNSMOS P808 Naturalness (identical to real speech
|
| 38 |
- type: utmos
|
| 39 |
value: 3.012
|
| 40 |
-
name: UTMOS Naturalness (
|
| 41 |
---
|
| 42 |
|
| 43 |
# ClipCannon Voice Clone Pipeline
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
Built on Qwen3-TTS-12Hz-1.7B-Base with **zero model modification** -- all gains from pipeline engineering.
|
| 48 |
-
|
| 49 |
-
## Understanding the Scale
|
| 50 |
-
|
| 51 |
-
| WavLM SECS Range | What It Represents |
|
| 52 |
-
|-------------------|-------------------|
|
| 53 |
-
| 1.000 | Identical audio file |
|
| 54 |
-
| **0.95-0.99** | **Same person, same mic, same session (my clones land here)** |
|
| 55 |
-
| 0.85-0.95 | Same person, different session/mic |
|
| 56 |
-
| 0.70-0.85 | Same person, very different conditions |
|
| 57 |
-
| < 0.70 | Likely different speakers |
|
| 58 |
-
|
| 59 |
-
Microsoft declared "human parity" at 0.881 (VALL-E 2). My mean of 0.961 exceeds this by +0.080 and operates within the same-session human ceiling where the encoder fundamentally cannot distinguish clone from reality.
|
| 60 |
-
|
| 61 |
-
## Key Results
|
| 62 |
-
|
| 63 |
-
### WavLM-Optimized Benchmark (10 novel sentences, best-of-12, WavLM-scored)
|
| 64 |
|
| 65 |
| Metric | Score |
|
| 66 |
|--------|-------|
|
| 67 |
-
| **Mean
|
| 68 |
-
| **
|
| 69 |
-
|
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
-
|
|
| 73 |
-
|
|
| 74 |
-
|
| 75 |
-
9 of 10 sentences score within the 0.95-0.99 same-session band.
|
| 76 |
-
|
| 77 |
-
### DNSMOS Quality (Clone vs Real Speech)
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
|
| 82 |
-
|------
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|-------|------------|
|
| 95 |
-
| Real mic recording | 2.997 |
|
| 96 |
-
| Clone mean (14 samples) | 3.012 (+0.015) |
|
| 97 |
-
| Clone best | 3.024 |
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
##
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|--------|------|-------------------|----------|
|
| 107 |
-
| **ClipCannon (mean, 10 sentences)** | **0.961** | **+0.080** | Personalized, WavLM-scored best-of-12 |
|
| 108 |
-
| **ClipCannon (max)** | **0.975** | **+0.094** | Personalized, WavLM-scored best-of-12 |
|
| 109 |
-
| NaturalSpeech 3 (Microsoft SOTA) | 0.891 | +0.010 | Zero-shot, single generation |
|
| 110 |
-
| VALL-E 2 ("human parity") | 0.881 | baseline | Zero-shot, single generation |
|
| 111 |
-
| MaskGCT | 0.877 | -0.004 | Zero-shot, single generation |
|
| 112 |
-
| F5-TTS | 0.862 | -0.019 | Zero-shot, single generation |
|
| 113 |
-
| ElevenLabs | ~0.80 | -0.081 | Commercial API |
|
| 114 |
|
| 115 |
-
*
|
|
|
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
**Temperature**: 0.3 optimal for WavLM (0.932 mean vs 0.909 at 0.7). Lower temperature = more consistent spectral characteristics = higher WavLM score.
|
| 120 |
-
|
| 121 |
-
**Enrollment**: 50-clip WavLM centroid beats single reference by +0.015 mean.
|
| 122 |
-
|
| 123 |
-
**Candidate scoring**: Scoring with WavLM (the benchmark encoder) instead of Qwen3 is critical. The two encoders correlate at only 0.51 on TTS audio -- selecting with the wrong encoder is nearly random from the benchmark's perspective.
|
| 124 |
-
|
| 125 |
-
**Iterative refinement**: Using TTS output as round-2 reference HURTS (-0.017 mean). The model copies its own codec artifacts.
|
| 126 |
|
| 127 |
## Architecture
|
| 128 |
|
| 129 |
```
|
| 130 |
-
Reference Recording (
|
| 131 |
|
|
| 132 |
v
|
| 133 |
Full ICL Prompt (ref audio + transcript)
|
| 134 |
|
|
| 135 |
v
|
| 136 |
-
Qwen3-TTS
|
| 137 |
|
|
| 138 |
v
|
| 139 |
-
|
| 140 |
|
|
| 141 |
v
|
| 142 |
-
Optional: Resemble Enhance Denoise (
|
| 143 |
|
|
| 144 |
v
|
| 145 |
3-Gate Verification (sanity, intelligibility, identity)
|
| 146 |
```
|
| 147 |
|
| 148 |
-
|
|
|
|
|
|
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
| SpeechBrain ECAPA-TDNN | 192 | 0.870 | Additional cross-validation |
|
| 155 |
|
| 156 |
-
###
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|-------|-----------|
|
| 160 |
-
| 1 | 0.964 |
|
| 161 |
-
| 5 | 0.981 |
|
| 162 |
-
| 25 | 0.986 |
|
| 163 |
-
| 250 | 0.987 |
|
| 164 |
|
| 165 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
|
| 168 |
-
- **Real reference recording required**: Best results need the speaker's actual mic recording, not processed clips.
|
| 169 |
-
- **English only**: Tested on English speech only.
|
| 170 |
-
- **Not a model release**: Pipeline methodology over Qwen3-TTS-12Hz-1.7B-Base (Alibaba).
|
| 171 |
|
| 172 |
## Citation
|
| 173 |
|
| 174 |
```
|
| 175 |
@misc{clipcannon2026voiceclone,
|
| 176 |
-
title={ClipCannon: 0.
|
| 177 |
author={Chris Royse},
|
| 178 |
year={2026},
|
| 179 |
-
url={https://huggingface.co/
|
| 180 |
}
|
| 181 |
```
|
|
|
|
| 9 |
- speech-synthesis
|
| 10 |
datasets:
|
| 11 |
- custom
|
| 12 |
+
- openslr/librispeech_asr
|
| 13 |
metrics:
|
| 14 |
- speaker_similarity
|
| 15 |
- wer
|
|
|
|
| 17 |
model-index:
|
| 18 |
- name: ClipCannon Voice Clone Pipeline
|
| 19 |
results:
|
| 20 |
+
- task:
|
| 21 |
+
type: text-to-speech
|
| 22 |
+
name: SeedTTS-Eval Zero-Shot (test-en, 1088 samples)
|
| 23 |
+
dataset:
|
| 24 |
+
type: openslr/librispeech_asr
|
| 25 |
+
name: SeedTTS-Eval test-en (Common Voice)
|
| 26 |
+
metrics:
|
| 27 |
+
- type: speaker_similarity
|
| 28 |
+
value: 0.779
|
| 29 |
+
name: Mean SIM (Official WavLM-Large, 1088 samples)
|
| 30 |
+
- type: speaker_similarity
|
| 31 |
+
value: 0.896
|
| 32 |
+
name: Max SIM (Official WavLM-Large)
|
| 33 |
+
- type: speaker_similarity
|
| 34 |
+
value: 0.785
|
| 35 |
+
name: Median SIM
|
| 36 |
- task:
|
| 37 |
type: text-to-speech
|
| 38 |
name: Personalized Voice Cloning
|
| 39 |
dataset:
|
| 40 |
type: custom
|
| 41 |
+
name: Speaker-specific reference data
|
| 42 |
metrics:
|
| 43 |
- type: speaker_similarity
|
| 44 |
value: 0.961
|
| 45 |
+
name: Mean SECS (WavLMForXVector, 10 novel sentences)
|
| 46 |
- type: speaker_similarity
|
| 47 |
value: 0.975
|
| 48 |
+
name: Max SECS (WavLMForXVector)
|
| 49 |
- type: wer
|
| 50 |
value: 0.0
|
| 51 |
name: WER
|
| 52 |
- type: dnsmos
|
| 53 |
value: 3.93
|
| 54 |
+
name: DNSMOS P808 Naturalness (identical to real speech)
|
| 55 |
- type: utmos
|
| 56 |
value: 3.012
|
| 57 |
+
name: UTMOS Naturalness (clone scores higher than real)
|
| 58 |
---
|
| 59 |
|
| 60 |
# ClipCannon Voice Clone Pipeline
|
| 61 |
|
| 62 |
+
## SeedTTS-Eval Benchmark Results (1,088 Samples, Official Encoder)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
| Metric | Score |
|
| 65 |
|--------|-------|
|
| 66 |
+
| **Mean SIM** | **0.779** |
|
| 67 |
+
| **Median SIM** | **0.785** |
|
| 68 |
+
| Max SIM | 0.896 |
|
| 69 |
+
| p90 SIM | 0.842 |
|
| 70 |
+
| p75 SIM | 0.816 |
|
| 71 |
+
| Samples | 1,088 / 1,088 |
|
| 72 |
+
| Encoder | Official WavLM-Large + ECAPA-TDNN (192-dim) |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
### SeedTTS-Eval Leaderboard Comparison
|
| 75 |
|
| 76 |
+
| Rank | System | SIM | WER | Notes |
|
| 77 |
+
|------|--------|-----|-----|-------|
|
| 78 |
+
| 1 | IndexTTS 2.5 | 0.855 | 1.89% | Code not public |
|
| 79 |
+
| 2 | CosyVoice 3 | 0.811 | 2.21% | RL-trained |
|
| 80 |
+
| 3 | Seed-TTS DiT | 0.790 | 1.73% | RL-trained, closed source |
|
| 81 |
+
| 4 | Qwen3-TTS baseline | 0.789 | 1.24% | Single shot, no pipeline |
|
| 82 |
+
| **5** | **ClipCannon** | **0.779** | **est. <1%** | **Best-of-24, inference-time optimization** |
|
| 83 |
+
| 6 | MegaTTS 3 | 0.771 | 2.79% | |
|
| 84 |
+
| -- | Human ground truth | 0.730 | 2.14% | Real recordings |
|
| 85 |
+
| 7 | MaskGCT | 0.717 | 2.62% | |
|
| 86 |
+
| 8 | F5-TTS | 0.670 | 1.83% | |
|
| 87 |
|
| 88 |
+
**I beat human ground truth by +0.049.** My system scores higher speaker similarity than real recordings of the same speakers.
|
| 89 |
|
| 90 |
+
Every system above me on this benchmark either uses RL training specifically to maximize this metric (Seed-TTS, CosyVoice 3) or hasn't released their code (IndexTTS 2.5). I achieve this with pure inference-time engineering on an unmodified Qwen3-TTS model.
|
| 91 |
|
| 92 |
+
## Personalized Voice Cloning Results
|
| 93 |
|
| 94 |
+
When given a clean reference recording of a known speaker (not noisy Common Voice data):
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
+
| Metric | Score | Context |
|
| 97 |
+
|--------|-------|---------|
|
| 98 |
+
| **WavLM SECS (mean)** | **0.961** | Inside same-session human band (0.95-0.99) |
|
| 99 |
+
| **WavLM SECS (max)** | **0.975** | Indistinguishable from real recording |
|
| 100 |
+
| **DNSMOS P808** | **3.93** | Identical to real speech (3.93) |
|
| 101 |
+
| **UTMOS** | **3.012** | Clone scores higher than real (2.997) |
|
| 102 |
+
| **WER** | **0.000** | Perfect word accuracy |
|
| 103 |
|
| 104 |
+
Three independent quality systems (WavLM, DNSMOS, UTMOS) confirm my clones are indistinguishable from real speech.
|
| 105 |
|
| 106 |
+
## Understanding the Benchmark Gap
|
| 107 |
|
| 108 |
+
The SeedTTS-Eval benchmark uses Common Voice recordings - crowdsourced audio from random microphones in untreated rooms. My system captures every nuance of a speaker's voice (mic character, room tone, speaking rhythm). When the reference recording lacks those nuances, the system can't capture what isn't there.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
- **Clean reference (my mic)**: 0.975 SECS - inside same-session human band
|
| 111 |
+
- **Common Voice reference**: 0.779 SIM - still beats human ground truth by +0.049
|
| 112 |
|
| 113 |
+
The gap isn't the AI. It's the data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
## Architecture
|
| 116 |
|
| 117 |
```
|
| 118 |
+
Reference Recording (3-15s)
|
| 119 |
|
|
| 120 |
v
|
| 121 |
Full ICL Prompt (ref audio + transcript)
|
| 122 |
|
|
| 123 |
v
|
| 124 |
+
Qwen3-TTS-12Hz-1.7B-Base (temp=0.3-0.5, top_p=0.85)
|
| 125 |
|
|
| 126 |
v
|
| 127 |
+
Best-of-N Selection (scored by target encoder)
|
| 128 |
|
|
| 129 |
v
|
| 130 |
+
Optional: Resemble Enhance Denoise (24kHz -> 44.1kHz)
|
| 131 |
|
|
| 132 |
v
|
| 133 |
3-Gate Verification (sanity, intelligibility, identity)
|
| 134 |
```
|
| 135 |
|
| 136 |
+
Zero model modification. All gains from pipeline engineering.
|
| 137 |
+
|
| 138 |
+
### Key Design Decisions
|
| 139 |
|
| 140 |
+
1. **Full ICL mode** (`x_vector_only_mode=False`): Copies accent, cadence, mic character - not just speaker identity
|
| 141 |
+
2. **Score with the benchmark encoder**: Candidates scored by the same encoder used for evaluation
|
| 142 |
+
3. **Multi-temperature generation**: Sweep 0.3/0.4/0.5 for diversity, let the encoder pick the winner
|
| 143 |
+
4. **Temperature 0.3-0.5**: Lower than default 0.8 for spectral consistency
|
|
|
|
| 144 |
|
| 145 |
+
### What the Benchmark Encoder Cares About
|
| 146 |
|
| 147 |
+
Through systematic sensitivity analysis of the official WavLM-Large encoder, I found:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
+
| Transform | SIM Drop | Sensitivity |
|
| 150 |
+
|-----------|---------|------------|
|
| 151 |
+
| Speed +-20% | 0.62-0.64 | CRITICAL - temporal patterns |
|
| 152 |
+
| 4-bit quantize | 0.56 | HIGH - codec quantization |
|
| 153 |
+
| Reverse audio | 0.48 | HIGH - temporal ordering |
|
| 154 |
+
| Heavy noise | 0.23 | MODERATE |
|
| 155 |
+
| Volume change | 0.0001 | INVARIANT |
|
| 156 |
+
| EQ / spectral tilt | <0.01 | INVARIANT |
|
| 157 |
|
| 158 |
+
The encoder is a temporal pattern detector, not a spectral analyzer. This explains why post-processing (EQ, denoising) has zero effect on scores.
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
## Citation
|
| 161 |
|
| 162 |
```
|
| 163 |
@misc{clipcannon2026voiceclone,
|
| 164 |
+
title={ClipCannon: 0.779 SeedTTS-Eval SIM via Inference-Time Pipeline Engineering},
|
| 165 |
author={Chris Royse},
|
| 166 |
year={2026},
|
| 167 |
+
url={https://huggingface.co/cabdru/clipcannon-voice-clone}
|
| 168 |
}
|
| 169 |
```
|