File size: 2,292 Bytes
316f990 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | ---
datasets:
- amphion/Emilia-Dataset
- nvidia/hifitts-2
language:
- en
license: cc-by-4.0
pipeline_tag: text-to-speech
tags:
- text-to-speech
- zero-shot
- streaming
---
# Model Card for VoXtream2
VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.
### Key features
- **Dynamic speed control**: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
- **Streaming performance**: Works **4x** times faster than real-time and achieves **74 ms** first packet latency in a full-stream on a consumer GPU.
- **Translingual capability**: Prompt text masking enables support of acoustic prompts in any language.
## Get started
### Usage
* Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
* Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
* Speaking rate (optional): target speaking rate in syllables per second.
#### Output streaming
```bash
python voxtream/run.py \
--prompt-audio assets/audio/english_male.wav \
--text "In general, however, some method is then needed to evaluate each approximation." \
--output "output_stream.wav"
```
#### Full streaming (slow speech, 2 syllables per second)
```bash
python voxtream/run.py \
--prompt-audio assets/audio/english_female.wav \
--text "Staff do not always do enough to prevent violence." \
--output "full_stream_2sps.wav" \
--full-stream \
--spk-rate 2.0
```
* Note: Initial run may take some time to download model weights and warmup model graph.
## Training Data
The model was trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets.
### Out-of-Scope Use
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
|