Text-to-Speech
Safetensors
English
voxtream
zero-shot
streaming
File size: 4,028 Bytes
b3e384a
 
 
 
 
 
49addec
b3e384a
49addec
b3e384a
 
49addec
 
b3e384a
 
 
 
49addec
 
 
b3e384a
 
 
 
 
 
 
 
 
49addec
 
 
 
b3e384a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49addec
b3e384a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
datasets:
- amphion/Emilia-Dataset
- nvidia/hifitts-2
language:
- en
license: cc-by-4.0
pipeline_tag: text-to-speech
library_name: voxtream
tags:
- text-to-speech
- zero-shot
- streaming
---

# Model Card for VoXtream2

VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. It was introduced in the paper [VoXtream2: Full-stream TTS with dynamic speaking rate control](https://huggingface.co/papers/2603.13518).

**Developed by:** Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

### Key features

- **Dynamic speed control**: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
- **Streaming performance**: Works **4x** times faster than real-time and achieves **74 ms** first packet latency in a full-stream on a consumer GPU.
- **Translingual capability**: Prompt text masking enables support of acoustic prompts in any language.

### Model Sources 

- **Repository:** [https://github.com/herimor/voxtream](https://github.com/herimor/voxtream) 
- **Paper:** [https://huggingface.co/papers/2603.13518](https://huggingface.co/papers/2603.13518) 
- **Demo Page:** [https://herimor.github.io/voxtream2](https://herimor.github.io/voxtream2)
- **Live Demo:** [https://huggingface.co/spaces/herimor/voxtream2](https://huggingface.co/spaces/herimor/voxtream2)

## Get started

### Installation

### eSpeak NG phonemizer

```bash
# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
apt-get install espeak-ng
# For RedHat-like distribution (e.g. CentOS, Fedora, etc.) 
yum install espeak-ng
# For MacOS
brew install espeak-ng
```

### Pip package

```bash
pip install "voxtream>=0.2"
```

### Usage

* Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
* Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
* Speaking rate (optional): target speaking rate in syllables per second.

#### Output streaming
```bash
voxtream \
    --prompt-audio assets/audio/english_male.wav \
    --text "In general, however, some method is then needed to evaluate each approximation." \
    --output "output_stream.wav"
```

#### Full streaming (slow speech, 2 syllables per second)
```bash
voxtream \
    --prompt-audio assets/audio/english_female.wav \
    --text "Staff do not always do enough to prevent violence." \
    --output "full_stream_2sps.wav" \
    --full-stream \
    --spk-rate 2.0
```

* Note: Initial run may take some time to download model weights and warmup model graph.

### Out-of-Scope Use

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

## Training Data

The model was trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. You can download preprocessed dataset [here](https://huggingface.co/datasets/herimor/voxtream2-train). For more details, please check our paper. 

## Citation 

```bibtex
@inproceedings{torgashov2026voxtream,
  title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  note={to appear},
  url={https://arxiv.org/abs/2509.15969}
}

@article{torgashov2026voxtream2,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
  journal   = {arXiv:2603.13518},
  year      = {2026}
}
```