model / README.md
voxtream2's picture
Initial commit
316f990 verified
|
Raw
History Blame Contribute Delete
2.29 kB
metadata
datasets:
  - amphion/Emilia-Dataset
  - nvidia/hifitts-2
language:
  - en
license: cc-by-4.0
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - zero-shot
  - streaming

Model Card for VoXtream2

VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.

Key features

  • Dynamic speed control: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
  • Streaming performance: Works 4x times faster than real-time and achieves 74 ms first packet latency in a full-stream on a consumer GPU.
  • Translingual capability: Prompt text masking enables support of acoustic prompts in any language.

Get started

Usage

  • Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
  • Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
  • Speaking rate (optional): target speaking rate in syllables per second.

Output streaming

python voxtream/run.py \
    --prompt-audio assets/audio/english_male.wav \
    --text "In general, however, some method is then needed to evaluate each approximation." \
    --output "output_stream.wav"

Full streaming (slow speech, 2 syllables per second)

python voxtream/run.py \
    --prompt-audio assets/audio/english_female.wav \
    --text "Staff do not always do enough to prevent violence." \
    --output "full_stream_2sps.wav" \
    --full-stream \
    --spk-rate 2.0
  • Note: Initial run may take some time to download model weights and warmup model graph.

Training Data

The model was trained on Emilia and HiFiTTS2 datasets.

Out-of-Scope Use

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.