Update README
Browse files
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-4.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-to-speech
|
| 6 |
+
tags:
|
| 7 |
+
- voxtream
|
| 8 |
+
- text-to-speech
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Model Card for VoXtream
|
| 12 |
+
|
| 13 |
+
VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
|
| 14 |
+
|
| 15 |
+
### Key featues
|
| 16 |
+
|
| 17 |
+
- **Streaming**: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
|
| 18 |
+
- **Speed**: Works **5x** times faster than real-time and achieves **102 ms** first packet latency on GPU.
|
| 19 |
+
- **Quality and efficiency**: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.
|
| 20 |
+
|
| 21 |
+
### Model Sources
|
| 22 |
+
|
| 23 |
+
- **Repository:** [repo](https://github.com/herimor/voxtream)
|
| 24 |
+
- **Paper:** [paper](https://herimor.github.io/voxtream)
|
| 25 |
+
- **Demo:** [demo](https://herimor.github.io/voxtream)
|
| 26 |
+
|
| 27 |
+
## Get started
|
| 28 |
+
|
| 29 |
+
Clone our [repo](https://github.com/herimor/voxtream) and follow instructions in README file.
|
| 30 |
+
|
| 31 |
+
### Out-of-Scope Use
|
| 32 |
+
|
| 33 |
+
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
|
| 34 |
+
|
| 35 |
+
## Training Data
|
| 36 |
+
|
| 37 |
+
The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. For more details please check our paper.
|
| 38 |
+
|
| 39 |
+
## Citation
|
| 40 |
+
|
| 41 |
+
```
|
| 42 |
+
@article{torgashov2025voxtream,
|
| 43 |
+
author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
|
| 44 |
+
title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
|
| 45 |
+
journal = {arXiv},
|
| 46 |
+
year = {2025}
|
| 47 |
+
}
|
| 48 |
+
```
|