File size: 6,962 Bytes
3cea4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6050dd5
 
 
 
 
 
3cea4a3
 
 
 
13ed2a1
 
 
 
 
 
 
 
 
 
 
 
 
 
3cea4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13ed2a1
3cea4a3
13ed2a1
3cea4a3
 
 
 
 
13ed2a1
3cea4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: apache-2.0
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - audio
  - speech-synthesis
  - voice-cloning
  - autoregressive
  - flow-matching
library_name: dots_tts
---

# dots.tts-base

<p align="left">
  <a href="https://github.com/rednote-hilab/dots.tts"><img src="https://img.shields.io/badge/GitHub-rednote--hilab%2Fdots.tts-blue?logo=github" alt="GitHub"></a>
  <a href="https://huggingface.co/spaces/rednote-hilab/dots.tts"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Spaces-Playground-orange" alt="Playground"></a>
  <a href="https://rednote-hilab.github.io/dots.tts-demo/"><img src="https://img.shields.io/badge/Demo%20Page-Live-red" alt="Demo Page"></a>
</p>

**dots.tts** is a **2B-parameter fully continuous, end-to-end autoregressive (AR) text-to-speech system**. The backbone pairs a semantic encoder, an LLM, and an autoregressive flow-matching acoustic head over a 48 kHz AudioVAE — no discrete codec tokens anywhere in the pipeline.

This repository hosts **`dots.tts-base`**, the **end-to-end pretrained checkpoint** trained on ~1.5M hours of speech. It is the foundation for the two post-trained variants and the recommended starting point for **fine-tuning**.

<table>
  <tr>
    <td align="left" valign="middle"><a href="https://huggingface.co/rednote-hilab/dots.tts-base"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-dots.tts--base-yellow" alt="dots.tts-base"></a></td>
    <td>← <em>you are here</em> — Pretrain (~1.5M h). Fine-tuning, full CFG / NFE control.</td>
  </tr>
  <tr>
    <td align="left" valign="middle"><a href="https://huggingface.co/rednote-hilab/dots.tts-soar"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-dots.tts--soar-yellow" alt="dots.tts-soar"></a></td>
    <td>+ Self-corrective Alignment. Highest zero-shot fidelity and speaker similarity; also recommended for fine-tuning.</td>
  </tr>
  <tr>
    <td align="left" valign="middle"><a href="https://huggingface.co/rednote-hilab/dots.tts-mf"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-dots.tts--mf-yellow" alt="dots.tts-mf"></a></td>
    <td>+ MeanFlow distillation. Few-step inference (NFE = 4), low latency.</td>
  </tr>
</table>

---

## Quick Start

### Installation

```bash
conda create -n dots_tts python=3.10 -y
conda activate dots_tts

python -m pip install --upgrade pip
python -m pip install "git+https://github.com/rednote-hilab/dots.tts.git" \
  -c "https://raw.githubusercontent.com/rednote-hilab/dots.tts/main/constraints/recommended.txt"
```

### CLI

```bash
# Continuation voice cloning (reference audio + transcript) — recommended
dots.tts \
  --model-name-or-path rednote-hilab/dots.tts-base \
  --text "Hello, this is a zero-shot voice cloning demonstration." \
  --prompt-audio /path/to/reference.wav \
  --prompt-text "The exact transcript of the reference audio." \
  --output clone.wav
```

### Python API

```python
from dots_tts.runtime import DotsTtsRuntime
import soundfile as sf

runtime = DotsTtsRuntime.from_pretrained(
    "rednote-hilab/dots.tts-base",
    precision="bfloat16",
)

result = runtime.generate(
    text="Hello, this is a quick speech synthesis test.",
    prompt_audio_path="/path/to/reference.wav",
    prompt_text="The exact transcript of the reference audio.",
    num_steps=10,
    guidance_scale=1.2,
)

sf.write("output.wav", result["audio"].float().cpu().squeeze().numpy(), result["sample_rate"])
```

### Recommended sampling settings

| Flag | Recommended | Notes |
|---|---:|---|
| `--num-steps` | `10``32` | Flow-matching sampling steps; higher = better quality, slower |
| `--guidance-scale` | `1.2` (default) | Standard CFG; raise modestly for stronger text/timbre adherence |

### Fine-tuning

`dots.tts-base` is the recommended starting point for fine-tuning. See the [training script](https://github.com/rednote-hilab/dots.tts/blob/main/scripts/train_dots_tts.py) and [smoke config](https://github.com/rednote-hilab/dots.tts/blob/main/configs/dots_tts.yaml) in the source repository:

```bash
accelerate launch scripts/train_dots_tts.py --config configs/dots_tts.yaml
```

---

## Architecture

A frozen **AudioVAE** encodes 48 kHz mono waveform into a continuous latent and decodes it back via a BigVGAN-style causal decoder. An **autoregressive backbone** predicts that latent one patch at a time:

- **Semantic encoder** — re-encodes each newly generated VAE patch into a compact embedding for the LLM, stripping high-variance acoustic detail.
- **LLM** — initialized from **Qwen2.5-1.5B-Base**, consumes BPE text directly (no phonemes), emits one hidden state per audio step.
- **AR flow-matching head** — a DiT that conditions on the LLM hidden state and the AR prefix to denoise the next VAE patch, with a frozen CAM++ speaker x-vector as side input.

---

## Performance — `dots.tts-base`

### Seed-TTS-Eval (zero-shot, ~3 s reference)

| Model | Params | test-en WER↓ / SIM↑ | test-zh WER↓ / SIM↑ | test-zh-hard WER↓ / SIM↑ | **Avg WER↓ / SIM↑** |
|---|---:|:---:|:---:|:---:|:---:|
| Seed-TTS | — | 2.25 / 76.2 | 1.12 / 79.6 | 7.59 / 77.6 | 3.65 / 77.8 |
| Qwen3-TTS | 1.7B | **1.23** / 71.7 | 1.22 / 77.0 | 6.76 / 74.8 | 3.07 / 74.5 |
| VoxCPM 2 | 2B | 1.84 / 75.3 | 0.97 / 79.5 | 8.13 / 75.3 | 3.65 / 76.7 |
| **dots.tts-base** | **2B** | 1.34 / **76.8** | **0.96** / **80.5** | **6.46** / **79.2** | **2.92** / **78.8** |

### MiniMax Multilingual (24 languages, average)

| Model | Avg WER↓ | Avg SIM↑ |
|---|:---:|:---:|
| MiniMax | **2.8** | 76.6 |
| Fish-Audio S2 | 3.7 | 78.0 |
| VoxCPM 2 | 5.7 | 82.3 |
| **dots.tts-base** | 6.6 | **83.5** |

See the [project README](https://github.com/rednote-hilab/dots.tts#-performance) for the full per-language breakdown, CV3-Eval and EmergentTTS-Eval results.

---

## Risks and Limitations

- **Misuse risk.** High-fidelity zero-shot voice cloning can produce highly realistic synthetic speech. This checkpoint is intended for research and authorized deployment. Do **not** use it for impersonation, fraud, or disinformation. Combine downstream use with consent-aware reference-audio policies, robust synthetic-speech detection, and content watermarking. Clearly mark AI-generated audio.
- **Low-resource WER gap.** A BPE backbone inherits the text LLM's language coverage at the cost of a higher data appetite. On script-divergent and under-represented languages (Arabic, Hindi, Turkish, Vietnamese) WER is higher than on high-resource languages; speaker similarity is preserved.
- **Speech-heavy training.** The backbone is trained on a speech-heavy mixture. Singing and unified speech + sound generation are not covered.

---

## Citation

```bibtex
@article{dotstts2026,
  title   = {dots.tts Technical Report},
  author  = {dots.tts Team},
  journal = {arXiv preprint},
  year    = {2026},
}
```

## License

Released under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).