--- datasets: - amphion/Emilia-Dataset - nvidia/hifitts-2 language: - en license: cc-by-4.0 pipeline_tag: text-to-speech tags: - text-to-speech - zero-shot - streaming --- # Model Card for VoXtream2 VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. ### Key features - **Dynamic speed control**: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech. - **Streaming performance**: Works **4x** times faster than real-time and achieves **74 ms** first packet latency in a full-stream on a consumer GPU. - **Translingual capability**: Prompt text masking enables support of acoustic prompts in any language. ## Get started ### Usage * Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed). * Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed). * Speaking rate (optional): target speaking rate in syllables per second. #### Output streaming ```bash python voxtream/run.py \ --prompt-audio assets/audio/english_male.wav \ --text "In general, however, some method is then needed to evaluate each approximation." \ --output "output_stream.wav" ``` #### Full streaming (slow speech, 2 syllables per second) ```bash python voxtream/run.py \ --prompt-audio assets/audio/english_female.wav \ --text "Staff do not always do enough to prevent violence." \ --output "full_stream_2sps.wav" \ --full-stream \ --spk-rate 2.0 ``` * Note: Initial run may take some time to download model weights and warmup model graph. ## Training Data The model was trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. ### Out-of-Scope Use Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.