voxtream2 commited on
Commit
316f990
·
verified ·
1 Parent(s): 3fc654a

Initial commit

Browse files
Files changed (4) hide show
  1. README.md +60 -0
  2. config.json +16 -0
  3. model.safetensors +3 -0
  4. phoneme_to_token.json +125 -0
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - amphion/Emilia-Dataset
4
+ - nvidia/hifitts-2
5
+ language:
6
+ - en
7
+ license: cc-by-4.0
8
+ pipeline_tag: text-to-speech
9
+ tags:
10
+ - text-to-speech
11
+ - zero-shot
12
+ - streaming
13
+ ---
14
+
15
+ # Model Card for VoXtream2
16
+
17
+ VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.
18
+
19
+ ### Key features
20
+
21
+ - **Dynamic speed control**: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
22
+ - **Streaming performance**: Works **4x** times faster than real-time and achieves **74 ms** first packet latency in a full-stream on a consumer GPU.
23
+ - **Translingual capability**: Prompt text masking enables support of acoustic prompts in any language.
24
+
25
+ ## Get started
26
+
27
+ ### Usage
28
+
29
+ * Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
30
+ * Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
31
+ * Speaking rate (optional): target speaking rate in syllables per second.
32
+
33
+ #### Output streaming
34
+ ```bash
35
+ python voxtream/run.py \
36
+ --prompt-audio assets/audio/english_male.wav \
37
+ --text "In general, however, some method is then needed to evaluate each approximation." \
38
+ --output "output_stream.wav"
39
+ ```
40
+
41
+ #### Full streaming (slow speech, 2 syllables per second)
42
+ ```bash
43
+ python voxtream/run.py \
44
+ --prompt-audio assets/audio/english_female.wav \
45
+ --text "Staff do not always do enough to prevent violence." \
46
+ --output "full_stream_2sps.wav" \
47
+ --full-stream \
48
+ --spk-rate 2.0
49
+ ```
50
+
51
+ * Note: Initial run may take some time to download model weights and warmup model graph.
52
+
53
+ ## Training Data
54
+
55
+ The model was trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets.
56
+
57
+
58
+ ### Out-of-Scope Use
59
+
60
+ Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "phone_former": "phone_former",
3
+ "temp_former": "temp_former",
4
+ "dep_former": "dep_former_csm",
5
+ "phone_vocab_size": 125,
6
+ "audio_vocab_size": 2050,
7
+ "audio_pad_size": 0,
8
+ "embedding_dim": 1024,
9
+ "spk_embedding_dim": 192,
10
+ "num_codebooks": 16,
11
+ "num_phone_states": 6,
12
+ "amortization_divisor": 16,
13
+ "max_look_ahead": 5,
14
+ "audio_window_size": 625,
15
+ "phone_window_size": 625
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0761a350f9908227dcdce4556328a5896d1bab9d609939869ea941f206febb5
3
+ size 1851507776
phoneme_to_token.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "a\u026a": 0,
3
+ "a\u026a\u0259": 1,
4
+ "a\u026a\u025a": 2,
5
+ "a\u028a": 3,
6
+ "b": 4,
7
+ "d": 5,
8
+ "d\u0292": 6,
9
+ "e\u026a": 7,
10
+ "f": 8,
11
+ "h": 9,
12
+ "i": 10,
13
+ "i\u0259": 11,
14
+ "i\u02d0": 12,
15
+ "j": 13,
16
+ "k": 14,
17
+ "l": 15,
18
+ "m": 16,
19
+ "n": 17,
20
+ "n\u0329": 18,
21
+ "o\u028a": 19,
22
+ "o\u02d0": 20,
23
+ "o\u02d0\u0279": 21,
24
+ "p": 22,
25
+ "r": 23,
26
+ "s": 24,
27
+ "t": 25,
28
+ "t\u0283": 26,
29
+ "u\u02d0": 27,
30
+ "v": 28,
31
+ "w": 29,
32
+ "x": 30,
33
+ "z": 31,
34
+ "\u00e6": 32,
35
+ "\u00f0": 33,
36
+ "\u014b": 34,
37
+ "\u0250": 35,
38
+ "\u0251\u02d0": 36,
39
+ "\u0251\u02d0\u0279": 37,
40
+ "\u0254": 38,
41
+ "\u0254\u026a": 39,
42
+ "\u0254\u02d0": 40,
43
+ "\u0254\u02d0\u0279": 41,
44
+ "\u0259": 42,
45
+ "\u0259l": 43,
46
+ "\u025a": 44,
47
+ "\u025b": 45,
48
+ "\u025b\u0279": 46,
49
+ "\u025c\u02d0": 47,
50
+ "\u0261": 48,
51
+ "\u026a": 49,
52
+ "\u026a\u0279": 50,
53
+ "\u0279": 51,
54
+ "\u027e": 52,
55
+ "\u0283": 53,
56
+ "\u028a": 54,
57
+ "\u028a\u0279": 55,
58
+ "\u028c": 56,
59
+ "\u0292": 57,
60
+ "\u0294": 58,
61
+ "\u02c8a\u026a": 59,
62
+ "\u02c8a\u026a\u0259": 60,
63
+ "\u02c8a\u026a\u025a": 61,
64
+ "\u02c8a\u028a": 62,
65
+ "\u02c8e\u026a": 63,
66
+ "\u02c8i\u0259": 64,
67
+ "\u02c8i\u02d0": 65,
68
+ "\u02c8o\u028a": 66,
69
+ "\u02c8o\u02d0": 67,
70
+ "\u02c8o\u02d0\u0279": 68,
71
+ "\u02c8u\u02d0": 69,
72
+ "\u02c8\u00e6": 70,
73
+ "\u02c8\u0251\u02d0": 71,
74
+ "\u02c8\u0251\u02d0\u0279": 72,
75
+ "\u02c8\u0254": 73,
76
+ "\u02c8\u0254\u026a": 74,
77
+ "\u02c8\u0254\u02d0": 75,
78
+ "\u02c8\u0254\u02d0\u0279": 76,
79
+ "\u02c8\u0259": 77,
80
+ "\u02c8\u025a": 78,
81
+ "\u02c8\u025b": 79,
82
+ "\u02c8\u025b\u0279": 80,
83
+ "\u02c8\u025b\u02d0": 81,
84
+ "\u02c8\u025c\u02d0": 82,
85
+ "\u02c8\u026a": 83,
86
+ "\u02c8\u026a\u0279": 84,
87
+ "\u02c8\u028a": 85,
88
+ "\u02c8\u028a\u0279": 86,
89
+ "\u02c8\u028c": 87,
90
+ "\u02cca\u026a": 88,
91
+ "\u02cca\u026a\u025a": 89,
92
+ "\u02cca\u028a": 90,
93
+ "\u02cce\u026a": 91,
94
+ "\u02cci\u0259": 92,
95
+ "\u02cci\u02d0": 93,
96
+ "\u02cco\u028a": 94,
97
+ "\u02cco\u02d0": 95,
98
+ "\u02cco\u02d0\u0279": 96,
99
+ "\u02ccu\u02d0": 97,
100
+ "\u02cc\u00e6": 98,
101
+ "\u02cc\u0250": 99,
102
+ "\u02cc\u0251\u02d0": 100,
103
+ "\u02cc\u0251\u02d0\u0279": 101,
104
+ "\u02cc\u0254": 102,
105
+ "\u02cc\u0254\u026a": 103,
106
+ "\u02cc\u0254\u02d0": 104,
107
+ "\u02cc\u0254\u02d0\u0279": 105,
108
+ "\u02cc\u0259": 106,
109
+ "\u02cc\u025b": 107,
110
+ "\u02cc\u025b\u0279": 108,
111
+ "\u02cc\u025c\u02d0": 109,
112
+ "\u02cc\u026a": 110,
113
+ "\u02cc\u026a\u0279": 111,
114
+ "\u02cc\u028a": 112,
115
+ "\u02cc\u028a\u0279": 113,
116
+ "\u02cc\u028c": 114,
117
+ "\u03b8": 115,
118
+ "\u1d7b": 116,
119
+ ".": 117,
120
+ ",": 118,
121
+ "?": 119,
122
+ "sil": 120,
123
+ "!": 121,
124
+ "unk": 122
125
+ }