Text-to-Speech
Safetensors
English
voxtream
zero-shot
streaming
herimor commited on
Commit
b3e384a
·
verified ·
1 Parent(s): daa413d

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,103 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - amphion/Emilia-Dataset
5
+ - nvidia/hifitts-2
6
+ language:
7
+ - en
8
+ pipeline_tag: text-to-speech
9
+ tags:
10
+ - text-to-speech
11
+ ---
12
+
13
+ # Model Card for VoXtream2
14
+
15
+ VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.
16
+
17
+ ### Key features
18
+
19
+ - **Dynamic speed control**: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
20
+ - **Streaming performance**: Works **4x** times faster than real-time and achieves **74 ms** first packet latency in a full-stream on a consumer GPU.
21
+ - **Translingual capability**: Prompt text masking enables support of acoustic prompts in any language.
22
+
23
+ ### Model Sources
24
+
25
+ - **Repository:** [repo](https://github.com/herimor/voxtream)
26
+ - **Paper:** [paper](https://arxiv.org/pdf/2603.13518)
27
+ - **Demo Page:** [demo page](https://herimor.github.io/voxtream2)
28
+ - **Live Demo:** [live demo](https://huggingface.co/spaces/herimor/voxtream2)
29
+
30
+ ## Get started
31
+
32
+ ### Installation
33
+
34
+ ### eSpeak NG phonemizer
35
+
36
+ ```bash
37
+ # For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
38
+ apt-get install espeak-ng
39
+ # For RedHat-like distribution (e.g. CentOS, Fedora, etc.)
40
+ yum install espeak-ng
41
+ # For MacOS
42
+ brew install espeak-ng
43
+ ```
44
+
45
+ ### Pip package
46
+
47
+ ```bash
48
+ pip install "voxtream>=0.2"
49
+ ```
50
+
51
+ ### Usage
52
+
53
+ * Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
54
+ * Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
55
+ * Speaking rate (optional): target speaking rate in syllables per second.
56
+
57
+ #### Output streaming
58
+ ```bash
59
+ voxtream \
60
+ --prompt-audio assets/audio/english_male.wav \
61
+ --text "In general, however, some method is then needed to evaluate each approximation." \
62
+ --output "output_stream.wav"
63
+ ```
64
+
65
+ #### Full streaming (slow speech, 2 syllables per second)
66
+ ```bash
67
+ voxtream \
68
+ --prompt-audio assets/audio/english_female.wav \
69
+ --text "Staff do not always do enough to prevent violence." \
70
+ --output "full_stream_2sps.wav" \
71
+ --full-stream \
72
+ --spk-rate 2.0
73
+ ```
74
+
75
+ * Note: Initial run may take some time to download model weights and warmup model graph.
76
+
77
+ ### Out-of-Scope Use
78
+
79
+ Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
80
+
81
+ ## Training Data
82
+
83
+ The model was trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. You can download preprocessed dataset [here](https://huggingface.co/datasets/herimor/voxtream2-train). For more details, please check our paper.
84
+
85
+ ## Citation
86
+
87
+ ```
88
+ @inproceedings{torgashov2026voxtream,
89
+ title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
90
+ author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
91
+ booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
92
+ year={2026},
93
+ note={to appear},
94
+ url={https://arxiv.org/abs/2509.15969}
95
+ }
96
+
97
+ @article{torgashov2026voxtream2,
98
+ author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
99
+ title = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
100
+ journal = {arXiv:2603.13518},
101
+ year = {2026}
102
+ }
103
+ ```
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "phone_former": "phone_former",
3
+ "temp_former": "temp_former",
4
+ "dep_former": "dep_former_csm",
5
+ "phone_vocab_size": 125,
6
+ "audio_vocab_size": 2050,
7
+ "audio_pad_size": 0,
8
+ "embedding_dim": 1024,
9
+ "spk_embedding_dim": 192,
10
+ "num_codebooks": 16,
11
+ "num_phone_states": 6,
12
+ "amortization_divisor": 16,
13
+ "max_look_ahead": 5,
14
+ "audio_window_size": 625,
15
+ "phone_window_size": 625
16
+ }
dep_former_csm.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3e80e66f39cb010de18763721eaa9523f07827ccf21dd7b8a1486d2abc4bc89
3
+ size 704938152
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0761a350f9908227dcdce4556328a5896d1bab9d609939869ea941f206febb5
3
+ size 1851507776
phoneme_to_token.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "a\u026a": 0,
3
+ "a\u026a\u0259": 1,
4
+ "a\u026a\u025a": 2,
5
+ "a\u028a": 3,
6
+ "b": 4,
7
+ "d": 5,
8
+ "d\u0292": 6,
9
+ "e\u026a": 7,
10
+ "f": 8,
11
+ "h": 9,
12
+ "i": 10,
13
+ "i\u0259": 11,
14
+ "i\u02d0": 12,
15
+ "j": 13,
16
+ "k": 14,
17
+ "l": 15,
18
+ "m": 16,
19
+ "n": 17,
20
+ "n\u0329": 18,
21
+ "o\u028a": 19,
22
+ "o\u02d0": 20,
23
+ "o\u02d0\u0279": 21,
24
+ "p": 22,
25
+ "r": 23,
26
+ "s": 24,
27
+ "t": 25,
28
+ "t\u0283": 26,
29
+ "u\u02d0": 27,
30
+ "v": 28,
31
+ "w": 29,
32
+ "x": 30,
33
+ "z": 31,
34
+ "\u00e6": 32,
35
+ "\u00f0": 33,
36
+ "\u014b": 34,
37
+ "\u0250": 35,
38
+ "\u0251\u02d0": 36,
39
+ "\u0251\u02d0\u0279": 37,
40
+ "\u0254": 38,
41
+ "\u0254\u026a": 39,
42
+ "\u0254\u02d0": 40,
43
+ "\u0254\u02d0\u0279": 41,
44
+ "\u0259": 42,
45
+ "\u0259l": 43,
46
+ "\u025a": 44,
47
+ "\u025b": 45,
48
+ "\u025b\u0279": 46,
49
+ "\u025c\u02d0": 47,
50
+ "\u0261": 48,
51
+ "\u026a": 49,
52
+ "\u026a\u0279": 50,
53
+ "\u0279": 51,
54
+ "\u027e": 52,
55
+ "\u0283": 53,
56
+ "\u028a": 54,
57
+ "\u028a\u0279": 55,
58
+ "\u028c": 56,
59
+ "\u0292": 57,
60
+ "\u0294": 58,
61
+ "\u02c8a\u026a": 59,
62
+ "\u02c8a\u026a\u0259": 60,
63
+ "\u02c8a\u026a\u025a": 61,
64
+ "\u02c8a\u028a": 62,
65
+ "\u02c8e\u026a": 63,
66
+ "\u02c8i\u0259": 64,
67
+ "\u02c8i\u02d0": 65,
68
+ "\u02c8o\u028a": 66,
69
+ "\u02c8o\u02d0": 67,
70
+ "\u02c8o\u02d0\u0279": 68,
71
+ "\u02c8u\u02d0": 69,
72
+ "\u02c8\u00e6": 70,
73
+ "\u02c8\u0251\u02d0": 71,
74
+ "\u02c8\u0251\u02d0\u0279": 72,
75
+ "\u02c8\u0254": 73,
76
+ "\u02c8\u0254\u026a": 74,
77
+ "\u02c8\u0254\u02d0": 75,
78
+ "\u02c8\u0254\u02d0\u0279": 76,
79
+ "\u02c8\u0259": 77,
80
+ "\u02c8\u025a": 78,
81
+ "\u02c8\u025b": 79,
82
+ "\u02c8\u025b\u0279": 80,
83
+ "\u02c8\u025b\u02d0": 81,
84
+ "\u02c8\u025c\u02d0": 82,
85
+ "\u02c8\u026a": 83,
86
+ "\u02c8\u026a\u0279": 84,
87
+ "\u02c8\u028a": 85,
88
+ "\u02c8\u028a\u0279": 86,
89
+ "\u02c8\u028c": 87,
90
+ "\u02cca\u026a": 88,
91
+ "\u02cca\u026a\u025a": 89,
92
+ "\u02cca\u028a": 90,
93
+ "\u02cce\u026a": 91,
94
+ "\u02cci\u0259": 92,
95
+ "\u02cci\u02d0": 93,
96
+ "\u02cco\u028a": 94,
97
+ "\u02cco\u02d0": 95,
98
+ "\u02cco\u02d0\u0279": 96,
99
+ "\u02ccu\u02d0": 97,
100
+ "\u02cc\u00e6": 98,
101
+ "\u02cc\u0250": 99,
102
+ "\u02cc\u0251\u02d0": 100,
103
+ "\u02cc\u0251\u02d0\u0279": 101,
104
+ "\u02cc\u0254": 102,
105
+ "\u02cc\u0254\u026a": 103,
106
+ "\u02cc\u0254\u02d0": 104,
107
+ "\u02cc\u0254\u02d0\u0279": 105,
108
+ "\u02cc\u0259": 106,
109
+ "\u02cc\u025b": 107,
110
+ "\u02cc\u025b\u0279": 108,
111
+ "\u02cc\u025c\u02d0": 109,
112
+ "\u02cc\u026a": 110,
113
+ "\u02cc\u026a\u0279": 111,
114
+ "\u02cc\u028a": 112,
115
+ "\u02cc\u028a\u0279": 113,
116
+ "\u02cc\u028c": 114,
117
+ "\u03b8": 115,
118
+ "\u1d7b": 116,
119
+ ".": 117,
120
+ ",": 118,
121
+ "?": 119,
122
+ "sil": 120,
123
+ "!": 121,
124
+ "unk": 122
125
+ }