herimor
/

voxtream

+---
+license: cc-by-4.0
+language:
+  - en
+pipeline_tag: text-to-speech
+tags:
+- voxtream
+- text-to-speech
+---
+# Model Card for VoXtream
+VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
+### Key featues
+- **Streaming**: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
+- **Speed**: Works **5x** times faster than real-time and achieves **102 ms** first packet latency on GPU.
+- **Quality and efficiency**: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.
+### Model Sources
+- **Repository:** [repo](https://github.com/herimor/voxtream)
+- **Paper:** [paper](https://herimor.github.io/voxtream)
+- **Demo:** [demo](https://herimor.github.io/voxtream)
+## Get started
+Clone our [repo](https://github.com/herimor/voxtream) and follow instructions in README file.
+### Out-of-Scope Use
+Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
+## Training Data
+The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. For more details please check our paper.
+## Citation
+```
+@article{torgashov2025voxtream,
+  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
+  title     = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
+  journal   = {arXiv},
+  year      = {2025}
+}
+```