herimor commited on
Commit
fa815a1
·
verified ·
1 Parent(s): ee787f9

Update README

Browse files
Files changed (1) hide show
  1. README.md +48 -3
README.md CHANGED
@@ -1,3 +1,48 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-to-speech
6
+ tags:
7
+ - voxtream
8
+ - text-to-speech
9
+ ---
10
+
11
+ # Model Card for VoXtream
12
+
13
+ VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
14
+
15
+ ### Key featues
16
+
17
+ - **Streaming**: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
18
+ - **Speed**: Works **5x** times faster than real-time and achieves **102 ms** first packet latency on GPU.
19
+ - **Quality and efficiency**: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.
20
+
21
+ ### Model Sources
22
+
23
+ - **Repository:** [repo](https://github.com/herimor/voxtream)
24
+ - **Paper:** [paper](https://herimor.github.io/voxtream)
25
+ - **Demo:** [demo](https://herimor.github.io/voxtream)
26
+
27
+ ## Get started
28
+
29
+ Clone our [repo](https://github.com/herimor/voxtream) and follow instructions in README file.
30
+
31
+ ### Out-of-Scope Use
32
+
33
+ Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
34
+
35
+ ## Training Data
36
+
37
+ The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. For more details please check our paper.
38
+
39
+ ## Citation
40
+
41
+ ```
42
+ @article{torgashov2025voxtream,
43
+ author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
44
+ title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
45
+ journal = {arXiv},
46
+ year = {2025}
47
+ }
48
+ ```