pltobing commited on
Commit
3717103
·
1 Parent(s): dfb0239

Add files, models, and assets

Browse files
.gitattributes CHANGED
@@ -33,3 +33,16 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ qwen3-tts_onnx/codec_decoder_model.onnx filter=lfs diff=lfs merge=lfs -text
37
+ qwen3-tts_onnx/speaker_encoder_model.onnx filter=lfs diff=lfs merge=lfs -text
38
+ qwen3-tts_onnx/talker_codec_embed_model.onnx filter=lfs diff=lfs merge=lfs -text
39
+ qwen3-tts_onnx/talker_local_model.onnx filter=lfs diff=lfs merge=lfs -text
40
+ qwen3-tts_onnx/talker_model.onnx filter=lfs diff=lfs merge=lfs -text
41
+ qwen3-tts_onnx/text_embed_proj_model.onnx filter=lfs diff=lfs merge=lfs -text
42
+ audio_ref/female_shadowheart.flac filter=lfs diff=lfs merge=lfs -text
43
+ audio_ref/male_old_movie.flac filter=lfs diff=lfs merge=lfs -text
44
+ audio_ref/male_petergriffin.wav filter=lfs diff=lfs merge=lfs -text
45
+ audio_ref/male_stewie.mp3 filter=lfs diff=lfs merge=lfs -text
46
+ audio_ref/rick-sanchez.mp3 filter=lfs diff=lfs merge=lfs -text
47
+ audio_ref/david-attenborough.mp3 filter=lfs diff=lfs merge=lfs -text
48
+ audio_synth/output_1775946408.2838778.wav filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ __pycache__
2
+ .*swp
README.md CHANGED
@@ -1,3 +1,249 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ru
4
+ - zh
5
+ - en
6
+ - de
7
+ - es
8
+ - fr
9
+ - ja
10
+ - it
11
+ - pt
12
+ - ko
13
+ tags:
14
+ - text-to-speech
15
+ - TTS
16
+ - ONNX
17
+ - qwen3-tts
18
+ - voice-clone
19
+ - streaming
20
+ - qwen3
21
+ - rvq
22
+ - multilingual
23
+ pipeline_tag: text-to-speech
24
  license: apache-2.0
25
+ base_model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
26
  ---
27
+
28
+ # Qwen3-TTS-Realtime ONNX Inference
29
+
30
+ Pure ONNX Runtime inference pipeline for [Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base), enabling **streaming text-to-speech** without PyTorch dependency at runtime.
31
+
32
+ ## Overview
33
+
34
+ This repository provides:
35
+
36
+ - **`qwen3_tts_inferencer_onnx.py`** — Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime.
37
+ - **`test_qwen3-tts-streaming_onnx.py`** — End-to-end test script that simulates LLM streaming text and produces a WAV file.
38
+
39
+ ## Architecture
40
+
41
+ ```
42
+ Reference Audio ──► Speaker Encoder ──► Speaker Embedding Vector (voice clone context)
43
+
44
+
45
+ Text Deltas ──► Talker LLM (Qwen3-0.6B) ──► [Hidden States, VQ Token]
46
+
47
+
48
+ Local Transformer ──► 15-codebook RVQ Tokens
49
+
50
+
51
+ VQ Token ──► Codec Decoder ──► 24 kHz Waveform
52
+ ```
53
+
54
+ | Component | ONNX Model | Description |
55
+ |-----------|------------|-------------|
56
+ | Talker LLM | `talker_model.onnx` | Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation. |
57
+ | Local Talker | `talker_local_model.onnx` | Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame. |
58
+ | Codec Decoder | `codec_decoder_model.onnx` | Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode. |
59
+ | Speaker Encoder | `speaker_encoder_model.onnx` | ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning. |
60
+ | Talker Codec Embed | `talker_codec_embed_model.onnx` | VQ embedding for the talker model. Consists of 2048 token vocabs. |
61
+ | Text Embed Projection | `text_embed_proj_model.onnx` | Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs. |
62
+
63
+ ## Requirements
64
+
65
+ ```
66
+ librosa
67
+ numpy
68
+ onnxruntime
69
+ python-box
70
+ soundfile
71
+ transformers==4.57.3
72
+ ```
73
+
74
+ Example installation with conda env:
75
+
76
+ ```bash
77
+ conda create --name qwen3-tts-streaming-onnx-1 python=3.12
78
+ conda activate qwen3-tts-streaming-onnx-1
79
+ pip install -r requirements.txt
80
+ ```
81
+
82
+ ## Directory Structure
83
+
84
+ ```
85
+ .
86
+ ├── test_qwen3-tts-streaming_onnx.py # End-to-end test script
87
+ ├── README.md
88
+ ├── requirements.txt
89
+ ├── qwen3-tts_onnx/ # FP32
90
+ │ ├── talker_model.onnx
91
+ │ ├── talker_local_model.onnx
92
+ │ ├── codec_decoder_model.onnx
93
+ │ ├── speaker_encoder_model.onnx
94
+ │ ├── talker_codec_embed_model.onnx
95
+ │ └── text_embed_proj_model.onnx
96
+ ├── configs/
97
+ │ ├── config.json # Talker, Local Talker, Speaker Encoder config
98
+ │ ├── speech_tokenizer_config.json # Codec config
99
+ │ ├── preprocessor_config.json # Text Processor configs
100
+ │ ├── tokenizer_config.json
101
+ │ ├── vocab.json
102
+ │ └── merges.txt
103
+ ├── src/
104
+ │ ├── core/
105
+ │ │ ├── configuration_qwen3_tts.py
106
+ │ │ └── processing_qwen3_tts.py
107
+ │ ├── inference/
108
+ │ │ └── qwen3_tts_inferencer_onnx.py # Core ONNX inference engine
109
+ │ └── utils/
110
+ │ └── audio_utils.py
111
+ ├── logs/
112
+ │ └── <log_synth>.txt
113
+ ├── audio_ref/
114
+ │ └── <reference_speaker>.[wav|mp3|flac]
115
+ └── audio_synth/
116
+ └── <synthesized_example>.wav
117
+ ```
118
+
119
+ ## Usage
120
+
121
+ ### Basic streaming TTS usage
122
+
123
+ ```bash
124
+ python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt
125
+ # audio automatically saved in audio_synth/ with default parameters, text, language.
126
+ ```
127
+
128
+ ### Usage with parameters
129
+
130
+ ```
131
+ python test_qwen3-tts-streaming_onnx.py \
132
+ --talker_model_path qwen3-tts_onnx/talker_model.onnx \
133
+ --talker_local_model_path qwen3-tts_onnx/talker_local_model.onnx \
134
+ --codec_decoder_model_path qwen3-tts_onnx/codec_decoder_model.onnx \
135
+ --speaker_encoder_model_path qwen3-tts_onnx/speaker_encoder_model.onnx \
136
+ --talker_codec_embed_model_path qwen3-tts_onnx/talker_codec_embed_model.onnx \
137
+ --text_embed_proj_model_path qwen3-tts_onnx/text_embed_proj_model.onnx \
138
+ --model_config_path configs/config.json \
139
+ --codec_config_path configs/tokenizer_config.json \
140
+ --backbone_config_path configs/config_backbone.json \
141
+ --preprocessor_config_dir configs/ \
142
+ --temperature 0.85 \
143
+ --top_p 0.8 \
144
+ --top_k 50 \
145
+ --repetition_penalty 1.9 \
146
+ --repetition_window 50 \
147
+ --num_threads 4 \
148
+ --chunk_frames 4 \
149
+ --prompt_wav audio_ref/speaker.[wav|flac|mp3] \
150
+ --out_wav output.wav \
151
+ --text "Text to be synthesized" \
152
+ --language "english"
153
+ ```
154
+
155
+ ### Available Languages
156
+ ```
157
+ "chinese", "english", "german", "italian", "portuguese",
158
+ "spanish", "japanese", "korean", "french", "russian"
159
+ ```
160
+
161
+ ### Programmatic Usage
162
+
163
+ ```python
164
+ from src.inference import Qwen3TTSInferencerONNX
165
+
166
+ # Create inferencer
167
+ inferencer = Qwen3TTSInferencerONNX(
168
+ talker_llm, talker_local, codec_decoder,
169
+ speaker_encoder, talker_codec_embed, text_embed_proj,
170
+ preprocessor_config_dir, model_config, codec_config,
171
+ audio_ref_path, language,
172
+ )
173
+ inferencer.reset_turn(reset_cache=True)
174
+
175
+ # Stream text and collect audio
176
+ for delta in your_llm_stream():
177
+ audio_frames = inferencer.push_text(delta)
178
+ ...
179
+ for audio_tokens in audio_frames:
180
+ ...
181
+ inferencer.push_tokens(audio_tokens)
182
+ for wav in inferencer.audio_chunks():
183
+ ...
184
+ yield wav
185
+ ```
186
+
187
+ ### Command-Line Arguments
188
+
189
+ | Argument | Type | Default | Description |
190
+ |----------|------|---------|-------------|
191
+ | `--talker_model_path` | str | "qwen3-tts_onnx/talker_model.onnx" | Path to talker LLM model |
192
+ | `--talker_local_model_path` | str | "qwen3-tts_onnx/talker_local_model.onnx" | Path to local talker transformer model |
193
+ | `--codec_decoder_model_path` | str | "qwen3-tts_onnx/codec_decoder_model.onnx" | Path to codec decoder model |
194
+ | `--speaker_encoder_model_path` | str | "qwen3-tts_onnx/speaker_encoder_model.onnx" | Path to speaker encoder model |
195
+ | `--talker_codec_embed_model_path` | str | "qwen3-tts_onnx/talker_codec_embed_model.onnx" | Path to talker codec embedding |
196
+ | `--text_embed_proj_model_path` | str | "qwen3-tts_onnx/text_embed_proj_model.onnx" | Path to text embedding and projection |
197
+ | `--preprocessor_config_dir` | str | "configs/" | Directory path to configuration files for the Qwen3 text tokenizer |
198
+ | `--model_config_path` | str | "configs/config.json" | Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base |
199
+ | `--codec_config_path` | str | "configs/speech_tokenizer_config.json" | Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base |
200
+ | `--temperature` | float | `0.85` | Sampling temperature |
201
+ | `--top_p` | float | `0.8` | Nucleus sampling threshold |
202
+ | `--top_k` | int | `50` | Top-k sampling cutoff |
203
+ | `--repetition_penalty` | float | `1.9` | Repetition penalty coefficient |
204
+ | `--repetition_window` | int | `50` | Window for repetition penalty |
205
+ | `--delta_chunk_chars` | int | `1` | Characters per simulated LLM delta |
206
+ | `--delta_delay_s` | float | `0.0` | Delay between simulated deltas (seconds) |
207
+ | `--num_threads` | int | `4` | Number of threads used in sess.intra_op_num_threads of the onnxruntime session options |
208
+ | `--chunk_frames` | int | `4` | Number of chunk frames to be passed on to the codec decoder forward each time [default 4 frame is 0.32 s as token rate is 12.5 Hz] |
209
+ | `--prompt_wav` | str | audio_ref/female_shadowheart.flac | Reference speaker audio for voice cloning |
210
+ | `--out_wav` | str | `out_streaming.wav` | Output WAV file path |
211
+ | `--text` | str | *(Russian text)* | Text to synthesize |
212
+ | `--language` | str | "russian" | Language of the text to synthesize |
213
+
214
+ #### By: [Patrick Lumbantobing](https://www.linkedin.com/in/patrick-lumban-tobing)
215
+
216
+ #### Copyright@[VertoX-AI](https://www.linkedin.com/company/vertoxai/)
217
+
218
+ ### Citation
219
+
220
+ If you use this system in your research, please cite:
221
+
222
+ ```bibtex
223
+ @misc{vertoxai2026streamingspeechtranslation,
224
+ title={Qwen3-TTS-Streaming-ONNX — VertoX-AI},
225
+ author={Tobing, P. L., VertoX-AI},
226
+ year={2026},
227
+ publisher={HuggingFace},
228
+ }
229
+ ```
230
+
231
+ ## License
232
+
233
+ This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS.
234
+
235
+ ```
236
+ Created by: Patrick Lumbantobing, Vertox-AI
237
+ Copyright (c) 2026 Vertox-AI. All rights reserved.
238
+
239
+ This work is licensed under the Apache License, Version 2.0.
240
+ To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).
241
+ ```
242
+
243
+ ---
244
+
245
+ ## Acknowledgements
246
+
247
+ - [Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) for the original Qwen3-TTS model.
248
+ - [Qwen3-TTS Technical Report](https://arxiv.org/abs/2601.15621) (Hu et al., 2026).
249
+ - [ONNX Runtime](https://onnxruntime.ai/) for high-performance cross-platform inference.
audio_ref/david-attenborough.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:582f8e1e3f3792d7b495e159c29a55bb95c4c46e90725c62807b4b12bf341603
3
+ size 322923
audio_ref/female_shadowheart.flac ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f85324f52b1f0cfb43b12123e8718d2841ed4a6f46f60c533fc89a7a4fb26ec9
3
+ size 1559549
audio_ref/male_old_movie.flac ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e342504ccff3d8808ef194e0aca31e2d1d5e96bbf45a188fec9de9a8b756b7c
3
+ size 303882
audio_ref/male_petergriffin.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0a8b708aee90c7dde4eed747ca0b453456b742650699c26fa6ee4e98c8cee0e
3
+ size 486882
audio_ref/male_stewie.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4eb86807929133d186bd951143121a915726f636101b1860a589d06c7a95ab6
3
+ size 395191
audio_ref/rick-sanchez.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bd9c6ffb765fda23297fae21725bc174a3092d9687c3606f11d00ae0df9fc1e
3
+ size 107943
audio_synth/output_1775946408.2838778.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:756e45aea6414dd5a76917f6507152af65d6d1574dcc6c70806feaec24c88e25
3
+ size 234284
configs/config.json ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3TTSForConditionalGeneration"
4
+ ],
5
+ "assistant_token_id": 77091,
6
+ "im_end_token_id": 151645,
7
+ "im_start_token_id": 151644,
8
+ "tts_bos_token_id": 151672,
9
+ "tts_eos_token_id": 151673,
10
+ "tts_pad_token_id": 151671,
11
+ "model_type": "qwen3_tts",
12
+ "tokenizer_type": "qwen3_tts_tokenizer_12hz",
13
+ "tts_model_size": "0b6",
14
+ "tts_model_type": "base",
15
+ "speaker_encoder_config": {
16
+ "enc_dim": 1024,
17
+ "sample_rate": 24000
18
+ },
19
+ "talker_config": {
20
+ "attention_bias": false,
21
+ "attention_dropout": 0,
22
+ "code_predictor_config": {
23
+ "_name_or_path": "",
24
+ "add_cross_attention": false,
25
+ "architectures": null,
26
+ "attention_bias": false,
27
+ "attention_dropout": 0,
28
+ "bad_words_ids": null,
29
+ "begin_suppress_tokens": null,
30
+ "bos_token_id": null,
31
+ "chunk_size_feed_forward": 0,
32
+ "cross_attention_hidden_size": null,
33
+ "decoder_start_token_id": null,
34
+ "diversity_penalty": 0.0,
35
+ "do_sample": false,
36
+ "early_stopping": false,
37
+ "encoder_no_repeat_ngram_size": 0,
38
+ "eos_token_id": null,
39
+ "exponential_decay_length_penalty": null,
40
+ "finetuning_task": null,
41
+ "forced_bos_token_id": null,
42
+ "forced_eos_token_id": null,
43
+ "head_dim": 128,
44
+ "hidden_act": "silu",
45
+ "hidden_size": 1024,
46
+ "id2label": {
47
+ "0": "LABEL_0",
48
+ "1": "LABEL_1"
49
+ },
50
+ "initializer_range": 0.02,
51
+ "intermediate_size": 3072,
52
+ "is_decoder": false,
53
+ "is_encoder_decoder": false,
54
+ "label2id": {
55
+ "LABEL_0": 0,
56
+ "LABEL_1": 1
57
+ },
58
+ "layer_types": [
59
+ "full_attention",
60
+ "full_attention",
61
+ "full_attention",
62
+ "full_attention",
63
+ "full_attention"
64
+ ],
65
+ "length_penalty": 1.0,
66
+ "max_length": 20,
67
+ "max_position_embeddings": 65536,
68
+ "max_window_layers": 28,
69
+ "min_length": 0,
70
+ "model_type": "qwen3_tts_talker_code_predictor",
71
+ "no_repeat_ngram_size": 0,
72
+ "num_attention_heads": 16,
73
+ "num_beam_groups": 1,
74
+ "num_beams": 1,
75
+ "num_code_groups": 16,
76
+ "num_hidden_layers": 5,
77
+ "num_key_value_heads": 8,
78
+ "num_return_sequences": 1,
79
+ "output_attentions": false,
80
+ "output_hidden_states": false,
81
+ "output_scores": false,
82
+ "pad_token_id": null,
83
+ "prefix": null,
84
+ "problem_type": null,
85
+ "pruned_heads": {},
86
+ "remove_invalid_values": false,
87
+ "repetition_penalty": 1.0,
88
+ "return_dict": true,
89
+ "return_dict_in_generate": false,
90
+ "rms_norm_eps": 1e-06,
91
+ "rope_scaling": null,
92
+ "rope_theta": 1000000,
93
+ "sep_token_id": null,
94
+ "sliding_window": null,
95
+ "suppress_tokens": null,
96
+ "task_specific_params": null,
97
+ "temperature": 1.0,
98
+ "tf_legacy_loss": false,
99
+ "tie_encoder_decoder": false,
100
+ "tie_word_embeddings": false,
101
+ "tokenizer_class": null,
102
+ "top_k": 50,
103
+ "top_p": 1.0,
104
+ "dtype": null,
105
+ "torchscript": false,
106
+ "typical_p": 1.0,
107
+ "use_bfloat16": false,
108
+ "use_cache": true,
109
+ "use_sliding_window": false,
110
+ "vocab_size": 2048
111
+ },
112
+ "codec_bos_id": 2149,
113
+ "codec_eos_token_id": 2150,
114
+ "codec_think_id": 2154,
115
+ "codec_language_id": {
116
+ "chinese": 2055,
117
+ "english": 2050,
118
+ "german": 2053,
119
+ "italian": 2070,
120
+ "portuguese": 2071,
121
+ "spanish": 2054,
122
+ "japanese": 2058,
123
+ "korean": 2064,
124
+ "french": 2061,
125
+ "russian": 2069
126
+ },
127
+ "codec_nothink_id": 2155,
128
+ "codec_pad_id": 2148,
129
+ "codec_think_bos_id": 2156,
130
+ "codec_think_eos_id": 2157,
131
+ "spk_id": {
132
+ },
133
+ "spk_is_dialect": {
134
+ },
135
+ "head_dim": 128,
136
+ "hidden_act": "silu",
137
+ "hidden_size": 1024,
138
+ "initializer_range": 0.02,
139
+ "intermediate_size": 3072,
140
+ "max_position_embeddings": 32768,
141
+ "model_type": "qwen3_tts_talker",
142
+ "num_attention_heads": 16,
143
+ "num_code_groups": 16,
144
+ "num_hidden_layers": 28,
145
+ "num_key_value_heads": 8,
146
+ "position_id_per_seconds": 13,
147
+ "rms_norm_eps": 1e-06,
148
+ "rope_scaling": {
149
+ "interleaved": true,
150
+ "mrope_section": [
151
+ 24,
152
+ 20,
153
+ 20
154
+ ],
155
+ "rope_type": "default",
156
+ "type": "default"
157
+ },
158
+ "rope_theta": 1000000,
159
+ "sliding_window": null,
160
+ "text_hidden_size": 2048,
161
+ "text_vocab_size": 151936,
162
+ "use_cache": true,
163
+ "use_sliding_window": false,
164
+ "vocab_size": 3072
165
+ },
166
+ "transformers_version": "4.57.3"
167
+ }
configs/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
configs/preprocessor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "padding_side": "left",
3
+ "padding_value": 0.0,
4
+ "processor_class": "Qwen3TTSProcessor",
5
+ "return_attention_mask": true
6
+ }
configs/speech_tokenizer_config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3TTSTokenizerV2Model"
4
+ ],
5
+ "model_type": "qwen3_tts_tokenizer_12hz",
6
+ "encoder_valid_num_quantizers": 16,
7
+ "input_sample_rate": 24000,
8
+ "output_sample_rate": 24000,
9
+ "decode_upsample_rate": 1920,
10
+ "encode_downsample_rate": 1920,
11
+ "decoder_config": {
12
+ "attention_bias": false,
13
+ "attention_dropout": 0.0,
14
+ "latent_dim": 1024,
15
+ "codebook_dim": 512,
16
+ "codebook_size": 2048,
17
+ "decoder_dim": 1536,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 512,
20
+ "intermediate_size": 1024,
21
+ "layer_scale_initial_scale": 0.01,
22
+ "max_position_embeddings": 8000,
23
+ "head_dim": 64,
24
+ "num_attention_heads": 16,
25
+ "num_hidden_layers": 8,
26
+ "num_key_value_heads": 16,
27
+ "num_quantizers": 16,
28
+ "num_semantic_quantizers": 1,
29
+ "rms_norm_eps": 1e-05,
30
+ "rope_theta": 10000,
31
+ "semantic_codebook_size": 4096,
32
+ "sliding_window": 72,
33
+ "upsample_rates": [
34
+ 8,
35
+ 5,
36
+ 4,
37
+ 3
38
+ ],
39
+ "upsampling_ratios": [
40
+ 2,
41
+ 2
42
+ ],
43
+ "vector_quantization_hidden_dimension": 512
44
+ },
45
+ "encoder_config": {
46
+ "_frame_rate": 12.5,
47
+ "attention_bias": false,
48
+ "attention_dropout": 0.0,
49
+ "audio_channels": 1,
50
+ "codebook_dim": 256,
51
+ "codebook_size": 2048,
52
+ "compress": 2,
53
+ "dilation_growth_rate": 2,
54
+ "dtype": "float32",
55
+ "head_dim": 64,
56
+ "hidden_act": "gelu",
57
+ "hidden_size": 512,
58
+ "initializer_range": 0.02,
59
+ "intermediate_size": 2048,
60
+ "kernel_size": 7,
61
+ "last_kernel_size": 3,
62
+ "layer_scale_initial_scale": 0.01,
63
+ "max_position_embeddings": 8000,
64
+ "norm_eps": 1e-05,
65
+ "normalize": false,
66
+ "num_attention_heads": 8,
67
+ "num_filters": 64,
68
+ "num_hidden_layers": 8,
69
+ "num_key_value_heads": 8,
70
+ "num_quantizers": 32,
71
+ "num_residual_layers": 1,
72
+ "num_semantic_quantizers": 1,
73
+ "pad_mode": "constant",
74
+ "residual_kernel_size": 3,
75
+ "rope_theta": 10000.0,
76
+ "sampling_rate": 24000,
77
+ "sliding_window": 250,
78
+ "transformers_version": "4.57.0.dev0",
79
+ "trim_right_ratio": 1.0,
80
+ "upsample_groups": 512,
81
+ "upsampling_ratios": [
82
+ 8,
83
+ 6,
84
+ 5,
85
+ 4
86
+ ],
87
+ "use_cache": false,
88
+ "use_causal_conv": true,
89
+ "use_conv_shortcut": false,
90
+ "use_streaming": false,
91
+ "vector_quantization_hidden_dimension": 256
92
+ },
93
+ "transformers_version": "4.57.3"
94
+ }
configs/tokenizer_config.json ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151669": {
214
+ "content": "<|audio_start|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<|audio_end|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "<tts_pad>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<tts_text_bos>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "<tts_text_eod>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<tts_text_bos_single>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<|audio_pad|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ }
269
+ },
270
+ "additional_special_tokens": [
271
+ "<|im_start|>",
272
+ "<|im_end|>",
273
+ "<|object_ref_start|>",
274
+ "<|object_ref_end|>",
275
+ "<|box_start|>",
276
+ "<|box_end|>",
277
+ "<|quad_start|>",
278
+ "<|quad_end|>",
279
+ "<|vision_start|>",
280
+ "<|vision_end|>",
281
+ "<|vision_pad|>",
282
+ "<|image_pad|>",
283
+ "<|video_pad|>",
284
+ "<|audio_start|>",
285
+ "<|audio_end|>",
286
+ "<tts_pad>",
287
+ "<tts_text_bos>",
288
+ "<tts_text_bos_single>",
289
+ "<|audio_pad|>"
290
+ ],
291
+ "extra_special_tokens": {
292
+ "image_token": "<|image_pad|>",
293
+ "audio_token": "<|audio_pad|>",
294
+ "video_token": "<|video_pad|>",
295
+ "vision_bos_token": "<|vision_start|>",
296
+ "vision_eos_token": "<|vision_end|>",
297
+ "audio_bos_token": "<|audio_start|>",
298
+ "audio_eos_token": "<|audio_end|>"
299
+ },
300
+ "bos_token": null,
301
+ "clean_up_tokenization_spaces": false,
302
+ "eos_token": "<|im_end|>",
303
+ "errors": "replace",
304
+ "model_max_length": 131072,
305
+ "pad_token": "<|endoftext|>",
306
+ "split_special_tokens": false,
307
+ "tokenizer_class": "Qwen2Tokenizer",
308
+ "unk_token": null,
309
+ "image_token": "<|image_pad|>",
310
+ "audio_token": "<|audio_pad|>",
311
+ "video_token": "<|video_pad|>",
312
+ "vision_bos_token": "<|vision_start|>",
313
+ "vision_eos_token": "<|vision_end|>",
314
+ "audio_bos_token": "<|audio_start|>",
315
+ "audio_eos_token": "<|audio_end|>"
316
+ }
configs/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
logs/log_test-streaming-onnx-1.txt ADDED
The diff for this file is too large to render. See raw diff
 
qwen3-tts_onnx/codec_decoder_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee45ad90b2eb510038fc070127bd91f5b6fc6f0eb46166ecbb5a3d810a5e4527
3
+ size 460939919
qwen3-tts_onnx/speaker_encoder_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f19706d97a2196652ae68dac6930400a8c5e287be1d37aede58fb35484e25506
3
+ size 35628286
qwen3-tts_onnx/talker_codec_embed_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b77e5e356e334771257fa18677ea9e731f02c654dc1d0cc46cd6da60166829b
3
+ size 12583165
qwen3-tts_onnx/talker_local_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38d8d4799ab48759400a55f046a8c9398dfaedf1d1eeabd1dcf38e27a544c368
3
+ size 561644701
qwen3-tts_onnx/talker_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce63eadcd5a9dd103c9b38341ac391bf97e4b870aff3972d6fa53d677589d305
3
+ size 1793942592
qwen3-tts_onnx/text_embed_proj_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8aec6c7a66b85b06405974d3f20d7daa872326e2b3b345770452b13e114e0aca
3
+ size 1269839153
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ librosa
2
+ numpy
3
+ onnxruntime
4
+ python-box
5
+ soundfile
6
+ transformers==4.57.3
src/core/__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 The Alibaba Qwen team.
3
+ # SPDX-License-Identifier: Apache-2.0
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ from .configuration_qwen3_tts import Qwen3TTSConfig
17
+ from .processing_qwen3_tts import Qwen3TTSProcessor
src/core/configuration_qwen3_tts.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from transformers.configuration_utils import (PretrainedConfig,
16
+ layer_type_validation)
17
+ from transformers.modeling_rope_utils import rope_config_validation
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class Qwen3TTSSpeakerEncoderConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`Qwen3TTSSpeakerEncoder`].
26
+ It is used to instantiate a Qwen3TTS speaker encoder model according to the specified arguments, defining the model
27
+ architecture. The architecture is based on the ECAPA-TDNN model.
28
+
29
+ Args:
30
+ mel_dim (`int`, *optional*, defaults to 128):
31
+ The dimension of the input mel-spectrogram.
32
+ enc_dim (`int`, *optional*, defaults to 192):
33
+ The dimension of the final speaker embedding.
34
+ enc_channels (`list[int]`, *optional*, defaults to `[512, 512, 512, 512, 1536]`):
35
+ A list of output channels for each TDNN/SERes2Net layer in the encoder. The first channel size is for the initial TDNN layer,
36
+ the intermediate ones for the `SqueezeExcitationRes2NetBlock` layers, and the last one for the multi-layer feature aggregation.
37
+ enc_kernel_sizes (`list[int]`, *optional*, defaults to `[5, 3, 3, 3, 1]`):
38
+ A list of kernel sizes for each layer in the encoder, corresponding to `enc_channels`.
39
+ enc_dilations (`list[int]`, *optional*, defaults to `[1, 2, 3, 4, 1]`):
40
+ A list of dilations for each layer in the encoder, corresponding to `enc_channels`.
41
+ enc_attention_channels (`int`, *optional*, defaults to 128):
42
+ The number of attention channels in the `AttentiveStatisticsPooling` layer.
43
+ enc_res2net_scale (`int`, *optional*,defaults to 8):
44
+ The scale of the `Res2NetBlock` in the encoder.
45
+ enc_se_channels (`int`, *optional*, defaults to 128):
46
+ The number of channels in the squeeze part of the `SqueezeExcitationBlock`.
47
+ """
48
+
49
+ def __init__(
50
+ self,
51
+ mel_dim=128,
52
+ enc_dim=1024,
53
+ enc_channels=[512, 512, 512, 512, 1536],
54
+ enc_kernel_sizes=[5, 3, 3, 3, 1],
55
+ enc_dilations=[1, 2, 3, 4, 1],
56
+ enc_attention_channels=128,
57
+ enc_res2net_scale=8,
58
+ enc_se_channels=128,
59
+ sample_rate=24000,
60
+ ):
61
+ self.mel_dim = mel_dim
62
+ self.enc_dim = enc_dim
63
+ self.enc_channels = enc_channels
64
+ self.enc_kernel_sizes = enc_kernel_sizes
65
+ self.enc_dilations = enc_dilations
66
+ self.enc_attention_channels = enc_attention_channels
67
+ self.enc_res2net_scale = enc_res2net_scale
68
+ self.enc_se_channels = enc_se_channels
69
+ self.sample_rate = sample_rate
70
+
71
+
72
+ class Qwen3TTSTalkerCodePredictorConfig(PretrainedConfig):
73
+ r"""
74
+ This is the configuration class to store the configuration of a [`Qwen3TTSTalkerCodePredictorModel`]. It is used to instantiate a
75
+ Qwen3TTSTalkerCodePredictor model according to the specified arguments, defining the model architecture.
76
+
77
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
78
+ documentation from [`PretrainedConfig`] for more information.
79
+
80
+
81
+ Args:
82
+ vocab_size (`int`, *optional*, defaults to 151936):
83
+ Vocabulary size of the Qwen3TTSTalkerCodePredictor model. Defines the number of different tokens that can be represented by the
84
+ `inputs_ids` passed when calling [`Qwen3TTSTalkerCodePredictorModel`]
85
+ hidden_size (`int`, *optional*, defaults to 4096):
86
+ Dimension of the hidden representations.
87
+ intermediate_size (`int`, *optional*, defaults to 22016):
88
+ Dimension of the MLP representations.
89
+ num_hidden_layers (`int`, *optional*, defaults to 32):
90
+ Number of hidden layers in the Transformer encoder.
91
+ num_attention_heads (`int`, *optional*, defaults to 32):
92
+ Number of attention heads for each attention layer in the Transformer encoder.
93
+ num_key_value_heads (`int`, *optional*, defaults to 32):
94
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
95
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
96
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
97
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
98
+ by meanpooling all the original heads within that group. For more details, check out [this
99
+ paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `32`.
100
+ head_dim (`int`, *optional*, defaults to 128):
101
+ The attention head dimension.
102
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
103
+ The non-linear activation function (function or string) in the decoder.
104
+ max_position_embeddings (`int`, *optional*, defaults to 32768):
105
+ The maximum sequence length that this model might ever be used with.
106
+ initializer_range (`float`, *optional*, defaults to 0.02):
107
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
108
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
109
+ The epsilon used by the rms normalization layers.
110
+ use_cache (`bool`, *optional*, defaults to `True`):
111
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
112
+ relevant if `config.is_decoder=True`.
113
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
114
+ Whether the model's input and output word embeddings should be tied.
115
+ rope_theta (`float`, *optional*, defaults to 10000.0):
116
+ The base period of the RoPE embeddings.
117
+ rope_scaling (`Dict`, *optional*):
118
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
119
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
120
+ accordingly.
121
+ Expected contents:
122
+ `rope_type` (`str`):
123
+ The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
124
+ 'llama3'], with 'default' being the original RoPE implementation.
125
+ `factor` (`float`, *optional*):
126
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
127
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
128
+ original maximum pre-trained length.
129
+ `original_max_position_embeddings` (`int`, *optional*):
130
+ Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
131
+ pretraining.
132
+ `attention_factor` (`float`, *optional*):
133
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
134
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
135
+ `factor` field to infer the suggested value.
136
+ `beta_fast` (`float`, *optional*):
137
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
138
+ ramp function. If unspecified, it defaults to 32.
139
+ `beta_slow` (`float`, *optional*):
140
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
141
+ ramp function. If unspecified, it defaults to 1.
142
+ `short_factor` (`list[float]`, *optional*):
143
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
144
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
145
+ size divided by the number of attention heads divided by 2
146
+ `long_factor` (`list[float]`, *optional*):
147
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
148
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
149
+ size divided by the number of attention heads divided by 2
150
+ `low_freq_factor` (`float`, *optional*):
151
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
152
+ `high_freq_factor` (`float`, *optional*):
153
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
154
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
155
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
156
+ use_sliding_window (`bool`, *optional*, defaults to `False`):
157
+ Whether to use sliding window attention.
158
+ sliding_window (`int`, *optional*, defaults to 4096):
159
+ Sliding window attention (SWA) window size. If not specified, will default to `4096`.
160
+ max_window_layers (`int`, *optional*, defaults to 28):
161
+ The number of layers using full attention. The first `max_window_layers` layers will use full attention, while any
162
+ additional layer afterwards will use SWA (Sliding Window Attention).
163
+ layer_types (`list`, *optional*):
164
+ Attention pattern for each layer.
165
+ attention_dropout (`float`, *optional*, defaults to 0.0):
166
+ The dropout ratio for the attention probabilities.
167
+
168
+ """
169
+
170
+ model_type = "qwen3_tts_talker_code_predictor"
171
+ keys_to_ignore_at_inference = ["past_key_values"]
172
+
173
+ # Default tensor parallel plan for base model `Qwen3TTSTalkerCodePredictor`
174
+ base_model_tp_plan = {
175
+ "layers.*.self_attn.q_proj": "colwise",
176
+ "layers.*.self_attn.k_proj": "colwise",
177
+ "layers.*.self_attn.v_proj": "colwise",
178
+ "layers.*.self_attn.o_proj": "rowwise",
179
+ "layers.*.mlp.gate_proj": "colwise",
180
+ "layers.*.mlp.up_proj": "colwise",
181
+ "layers.*.mlp.down_proj": "rowwise",
182
+ }
183
+ base_model_pp_plan = {
184
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
185
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
186
+ "norm": (["hidden_states"], ["hidden_states"]),
187
+ }
188
+
189
+ def __init__(
190
+ self,
191
+ vocab_size=2048,
192
+ hidden_size=1024,
193
+ intermediate_size=3072,
194
+ num_hidden_layers=5,
195
+ num_attention_heads=16,
196
+ num_key_value_heads=8,
197
+ head_dim=128,
198
+ hidden_act="silu",
199
+ max_position_embeddings=32768,
200
+ initializer_range=0.02,
201
+ rms_norm_eps=0.000001,
202
+ use_cache=True,
203
+ tie_word_embeddings=False,
204
+ rope_theta=10000,
205
+ rope_scaling=None,
206
+ attention_bias=False,
207
+ use_sliding_window=False,
208
+ sliding_window=4096,
209
+ max_window_layers=28,
210
+ layer_types=None,
211
+ attention_dropout=0,
212
+ num_code_groups=32,
213
+ **kwargs,
214
+ ):
215
+ super().__init__(
216
+ tie_word_embeddings=tie_word_embeddings,
217
+ **kwargs,
218
+ )
219
+ self.vocab_size = vocab_size
220
+ self.max_position_embeddings = max_position_embeddings
221
+ self.hidden_size = hidden_size
222
+ self.intermediate_size = intermediate_size
223
+ self.num_hidden_layers = num_hidden_layers
224
+ self.num_attention_heads = num_attention_heads
225
+ self.use_sliding_window = use_sliding_window
226
+ self.sliding_window = sliding_window if self.use_sliding_window else None
227
+ self.max_window_layers = max_window_layers
228
+
229
+ # for backward compatibility
230
+ if num_key_value_heads is None:
231
+ num_key_value_heads = num_attention_heads
232
+
233
+ self.num_key_value_heads = num_key_value_heads
234
+ self.head_dim = head_dim
235
+ self.hidden_act = hidden_act
236
+ self.initializer_range = initializer_range
237
+ self.rms_norm_eps = rms_norm_eps
238
+ self.use_cache = use_cache
239
+ self.rope_theta = rope_theta
240
+ self.rope_scaling = rope_scaling
241
+ self.attention_bias = attention_bias
242
+ self.attention_dropout = attention_dropout
243
+ # Validate the correctness of rotary position embeddings parameters
244
+ # BC: if there is a 'type' field, move it to 'rope_type'.
245
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
246
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
247
+ rope_config_validation(self)
248
+
249
+ self.layer_types = layer_types
250
+ if self.layer_types is None:
251
+ self.layer_types = [
252
+ (
253
+ "sliding_attention"
254
+ if self.sliding_window is not None and i >= self.max_window_layers
255
+ else "full_attention"
256
+ )
257
+ for i in range(self.num_hidden_layers)
258
+ ]
259
+ layer_type_validation(self.layer_types)
260
+ self.num_code_groups = num_code_groups
261
+
262
+
263
+ class Qwen3TTSTalkerConfig(PretrainedConfig):
264
+ r"""
265
+ This is the configuration class to store the configuration of a [`Qwen3TTSTalkerModel`]. It is used to instantiate a
266
+ Qwen3TTSTalker model according to the specified arguments, defining the model architecture.
267
+
268
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
269
+ documentation from [`PretrainedConfig`] for more information.
270
+
271
+
272
+ Args:
273
+ vocab_size (`int`, *optional*, defaults to 151936):
274
+ Vocabulary size of the Qwen3TTSTalker model. Defines the number of different tokens that can be represented by the
275
+ `inputs_ids` passed when calling [`Qwen3TTSTalkerModel`]
276
+ hidden_size (`int`, *optional*, defaults to 2048):
277
+ Dimension of the hidden representations.
278
+ intermediate_size (`int`, *optional*, defaults to 6144):
279
+ Dimension of the MLP representations.
280
+ num_hidden_layers (`int`, *optional*, defaults to 24):
281
+ Number of hidden layers in the Transformer encoder.
282
+ num_attention_heads (`int`, *optional*, defaults to 32):
283
+ Number of attention heads for each attention layer in the Transformer encoder.
284
+ num_key_value_heads (`int`, *optional*, defaults to 4):
285
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
286
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
287
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
288
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
289
+ by meanpooling all the original heads within that group. For more details, check out [this
290
+ paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `32`.
291
+
292
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
293
+ The non-linear activation function (function or string) in the decoder.
294
+ max_position_embeddings (`int`, *optional*, defaults to 32768):
295
+ The maximum sequence length that this model might ever be used with.
296
+ initializer_range (`float`, *optional*, defaults to 0.02):
297
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
298
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
299
+ The epsilon used by the rms normalization layers.
300
+ use_cache (`bool`, *optional*, defaults to `True`):
301
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
302
+ relevant if `config.is_decoder=True`.
303
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
304
+ Whether the model's input and output word embeddings should be tied.
305
+ rope_theta (`float`, *optional*, defaults to 10000.0):
306
+ The base period of the RoPE embeddings.
307
+ rope_scaling (`Dict`, *optional*):
308
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
309
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
310
+ accordingly.
311
+ Expected contents:
312
+ `rope_type` (`str`):
313
+ The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
314
+ 'llama3'], with 'default' being the original RoPE implementation.
315
+ `factor` (`float`, *optional*):
316
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
317
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
318
+ original maximum pre-trained length.
319
+ `original_max_position_embeddings` (`int`, *optional*):
320
+ Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
321
+ pretraining.
322
+ `attention_factor` (`float`, *optional*):
323
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
324
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
325
+ `factor` field to infer the suggested value.
326
+ `beta_fast` (`float`, *optional*):
327
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
328
+ ramp function. If unspecified, it defaults to 32.
329
+ `beta_slow` (`float`, *optional*):
330
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
331
+ ramp function. If unspecified, it defaults to 1.
332
+ `short_factor` (`list[float]`, *optional*):
333
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
334
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
335
+ size divided by the number of attention heads divided by 2
336
+ `long_factor` (`list[float]`, *optional*):
337
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
338
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
339
+ size divided by the number of attention heads divided by 2
340
+ `low_freq_factor` (`float`, *optional*):
341
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
342
+ `high_freq_factor` (`float`, *optional*):
343
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
344
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
345
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
346
+ use_sliding_window (`bool`, *optional*, defaults to `False`):
347
+ Whether to use sliding window attention.
348
+ sliding_window (`int`, *optional*, defaults to 4096):
349
+ Sliding window attention (SWA) window size. If not specified, will default to `4096`.
350
+ attention_dropout (`float`, *optional*, defaults to 0.0):
351
+ The dropout ratio for the attention probabilities.
352
+ """
353
+
354
+ model_type = "qwen3_tts_talker"
355
+ keys_to_ignore_at_inference = ["past_key_values"]
356
+
357
+ # Default tensor parallel plan for base model `Qwen3TTSTalker`
358
+ base_model_tp_plan = {
359
+ "layers.*.self_attn.q_proj": "colwise",
360
+ "layers.*.self_attn.k_proj": "colwise",
361
+ "layers.*.self_attn.v_proj": "colwise",
362
+ "layers.*.self_attn.o_proj": "rowwise",
363
+ "layers.*.mlp.gate_proj": "colwise",
364
+ "layers.*.mlp.up_proj": "colwise",
365
+ "layers.*.mlp.down_proj": "rowwise",
366
+ }
367
+ base_model_pp_plan = {
368
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
369
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
370
+ "norm": (["hidden_states"], ["hidden_states"]),
371
+ }
372
+ sub_configs = {"code_predictor_config": Qwen3TTSTalkerCodePredictorConfig}
373
+
374
+ def __init__(
375
+ self,
376
+ code_predictor_config=None,
377
+ vocab_size=3072,
378
+ hidden_size=1024,
379
+ intermediate_size=2048,
380
+ num_hidden_layers=20,
381
+ num_attention_heads=16,
382
+ num_key_value_heads=2,
383
+ hidden_act="silu",
384
+ max_position_embeddings=32768,
385
+ initializer_range=0.02,
386
+ rms_norm_eps=0.000001,
387
+ use_cache=True,
388
+ tie_word_embeddings=False,
389
+ rope_theta=10000,
390
+ rope_scaling=None,
391
+ attention_bias=False,
392
+ use_sliding_window=False,
393
+ sliding_window=4096,
394
+ attention_dropout=0,
395
+ num_code_groups=32,
396
+ text_hidden_size=2048,
397
+ codec_eos_token_id=4198,
398
+ codec_think_id=4202,
399
+ codec_nothink_id=4203,
400
+ codec_think_bos_id=4204,
401
+ codec_think_eos_id=4205,
402
+ codec_pad_id=4196,
403
+ codec_bos_id=4197,
404
+ spk_id=None,
405
+ spk_is_dialect=None,
406
+ codec_language_id=None,
407
+ **kwargs,
408
+ ):
409
+ super().__init__(
410
+ tie_word_embeddings=tie_word_embeddings,
411
+ **kwargs,
412
+ )
413
+ self.vocab_size = vocab_size
414
+ self.max_position_embeddings = max_position_embeddings
415
+ self.hidden_size = hidden_size
416
+ self.intermediate_size = intermediate_size
417
+ self.num_hidden_layers = num_hidden_layers
418
+ self.num_attention_heads = num_attention_heads
419
+ self.use_sliding_window = use_sliding_window
420
+ self.sliding_window = sliding_window if use_sliding_window else None
421
+
422
+ self.num_key_value_heads = num_key_value_heads
423
+ self.hidden_act = hidden_act
424
+ self.initializer_range = initializer_range
425
+ self.rms_norm_eps = rms_norm_eps
426
+ self.use_cache = use_cache
427
+ self.rope_theta = rope_theta
428
+ self.rope_scaling = rope_scaling
429
+ self.attention_bias = attention_bias
430
+ self.attention_dropout = attention_dropout
431
+ # Validate the correctness of rotary position embeddings parameters
432
+ # BC: if there is a 'type' field, move it to 'rope_type'.
433
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
434
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
435
+
436
+ if code_predictor_config is None:
437
+ code_predictor_config = {}
438
+ self.code_predictor_config = Qwen3TTSTalkerCodePredictorConfig()
439
+ logger.info("code_predictor_config is None. Initializing code_predictor model with default values")
440
+ elif isinstance(code_predictor_config, Qwen3TTSTalkerCodePredictorConfig):
441
+ self.code_predictor_config = code_predictor_config
442
+ else:
443
+ self.code_predictor_config = Qwen3TTSTalkerCodePredictorConfig(**code_predictor_config)
444
+ self.num_code_groups = num_code_groups
445
+ self.text_hidden_size = text_hidden_size
446
+ self.codec_eos_token_id = codec_eos_token_id
447
+ self.codec_think_id = codec_think_id
448
+ self.codec_language_id = codec_language_id
449
+ self.codec_nothink_id = codec_nothink_id
450
+ self.codec_think_bos_id = codec_think_bos_id
451
+ self.codec_think_eos_id = codec_think_eos_id
452
+ self.codec_pad_id = codec_pad_id
453
+ self.codec_bos_id = codec_bos_id
454
+ self.spk_id = spk_id
455
+ self.spk_is_dialect = spk_is_dialect
456
+
457
+
458
+ class Qwen3TTSConfig(PretrainedConfig):
459
+ """
460
+ This is the configuration class to store the configuration of a [`Qwen3TTSForConditionalGeneration`].
461
+ """
462
+
463
+ model_type = "qwen3_tts"
464
+ sub_configs = {
465
+ "talker_config": Qwen3TTSTalkerConfig,
466
+ "speaker_encoder_config": Qwen3TTSSpeakerEncoderConfig,
467
+ }
468
+
469
+ def __init__(
470
+ self,
471
+ talker_config=None,
472
+ speaker_encoder_config=None,
473
+ tokenizer_type=None,
474
+ tts_model_size=None,
475
+ tts_model_type=None,
476
+ im_start_token_id=151644,
477
+ im_end_token_id=151645,
478
+ tts_pad_token_id=151671,
479
+ tts_bos_token_id=151672,
480
+ tts_eos_token_id=151673,
481
+ **kwargs,
482
+ ):
483
+ super().__init__(**kwargs)
484
+
485
+ if talker_config is None:
486
+ talker_config = {}
487
+ logger.info("talker_config is None. Initializing talker model with default values")
488
+ if speaker_encoder_config is None:
489
+ speaker_encoder_config = {}
490
+ logger.info("speaker_encoder_config is None. Initializing talker model with default values")
491
+
492
+ self.talker_config = Qwen3TTSTalkerConfig(**talker_config)
493
+ self.speaker_encoder_config = Qwen3TTSSpeakerEncoderConfig(**speaker_encoder_config)
494
+
495
+ self.tokenizer_type = tokenizer_type
496
+ self.tts_model_size = tts_model_size
497
+ self.tts_model_type = tts_model_type
498
+
499
+ self.im_start_token_id = im_start_token_id
500
+ self.im_end_token_id = im_end_token_id
501
+ self.tts_pad_token_id = tts_pad_token_id
502
+ self.tts_bos_token_id = tts_bos_token_id
503
+ self.tts_eos_token_id = tts_eos_token_id
504
+
505
+
506
+ __all__ = ["Qwen3TTSConfig", "Qwen3TTSTalkerConfig", "Qwen3TTSSpeakerEncoderConfig"]
src/core/processing_qwen3_tts.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from transformers.feature_extraction_utils import BatchFeature
16
+ from transformers.processing_utils import ProcessingKwargs, ProcessorMixin
17
+
18
+
19
+ class Qwen3TTSProcessorKwargs(ProcessingKwargs, total=False):
20
+ _defaults = {
21
+ "text_kwargs": {
22
+ "padding": False,
23
+ "padding_side": "left",
24
+ }
25
+ }
26
+
27
+
28
+ class Qwen3TTSProcessor(ProcessorMixin):
29
+ r"""
30
+ Constructs a Qwen3TTS processor.
31
+
32
+ Args:
33
+ tokenizer ([`Qwen2TokenizerFast`], *optional*):
34
+ The text tokenizer.
35
+ chat_template (`Optional[str]`, *optional*):
36
+ The Jinja template to use for formatting the conversation. If not provided, the default chat template is used.
37
+ """
38
+
39
+ attributes = ["tokenizer"]
40
+ tokenizer_class = ("Qwen2Tokenizer", "Qwen2TokenizerFast")
41
+
42
+ def __init__(self, tokenizer=None, chat_template=None):
43
+ super().__init__(tokenizer, chat_template=chat_template)
44
+
45
+ def __call__(self, text=None, **kwargs) -> BatchFeature:
46
+ """
47
+ Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
48
+ and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
49
+ the text.
50
+
51
+ Args:
52
+ text (`str`, `List[str]`, `List[List[str]]`):
53
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
54
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
55
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
56
+ """
57
+
58
+ if text is None:
59
+ raise ValueError("You need to specify either a `text` input to process.")
60
+
61
+ output_kwargs = self._merge_kwargs(
62
+ Qwen3TTSProcessorKwargs,
63
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
64
+ **kwargs,
65
+ )
66
+ if not isinstance(text, list):
67
+ text = [text]
68
+
69
+ print(f"Qwen3TTSProcessor __call__ text {text}")
70
+ print(f"Qwen3TTSProcessor __call__ output_kwargs[text_kwargs] {output_kwargs['text_kwargs']}")
71
+ texts_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
72
+ print(f"Qwen3TTSProcessor __call__ texts_inputs {texts_inputs}")
73
+
74
+ return BatchFeature(
75
+ data={**texts_inputs},
76
+ tensor_type=kwargs.get("return_tensors"),
77
+ )
78
+
79
+ def batch_decode(self, *args, **kwargs):
80
+ """
81
+ This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
82
+ refer to the docstring of this method for more information.
83
+ """
84
+ return self.tokenizer.batch_decode(*args, **kwargs)
85
+
86
+ def decode(self, *args, **kwargs):
87
+ """
88
+ This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
89
+ the docstring of this method for more information.
90
+ """
91
+ return self.tokenizer.decode(*args, **kwargs)
92
+
93
+ def apply_chat_template(self, conversations, chat_template=None, **kwargs):
94
+ if isinstance(conversations[0], dict):
95
+ conversations = [conversations]
96
+ return super().apply_chat_template(conversations, chat_template, **kwargs)
97
+
98
+ @property
99
+ def model_input_names(self):
100
+ tokenizer_input_names = self.tokenizer.model_input_names
101
+ return list(dict.fromkeys(tokenizer_input_names))
102
+
103
+
104
+ __all__ = ["Qwen3TTSProcessor"]
src/inference/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .qwen3_tts_inferencer_onnx import Qwen3TTSInferencerONNX
src/inference/qwen3_tts_inferencer_onnx.py ADDED
@@ -0,0 +1,1112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2026 Patrick Lumbantobing, Vertox-AI
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ """
16
+ ONNX-based inference engine for Qwen3-TTS-Realtime streaming text-to-speech.
17
+ This module provides a pure NumPy / ONNX Runtime inference pipeline for the
18
+ Qwen3-TTS-Realtime model, enabling streaming TTS without PyTorch dependency
19
+ at runtime. The pipeline follows the same architectural flow as the original
20
+ PyTorch implementation:
21
+ 1. **ECAPA-TDNN speaker encoder** -- encodes a reference waveform into
22
+ a 1024-dim speaker embedding vector for voice cloning.
23
+ 2. **Talker Backbone LLM** -- autoregressively maps interleaved
24
+ text + audio tokens embeddings to hidden states and VQ audio semantic
25
+ token via a causal language model with KV-cache. Sampling is also
26
+ performed within ONNX graph; sampling parameters can be adjusted.
27
+ 3. **Local Talker Transformer** -- depth-wise decoder that converts each
28
+ talker hidden state and VQ into a 15-codebook audio frame using its
29
+ own ephemeral KV-cache (reset per frame). Similarly, sampling is
30
+ within ONNX graph, and sampling parameters can be adjusted.
31
+ 4. **Codec decoder** -- converts batches of RVQ audio codes (1+15=16 codes
32
+ per frame) back into 24 kHz waveform samples with streaming KV-cache
33
+ and convolution cache support.
34
+ Typical usage::
35
+ inferencer = Qwen3TTSInferencerONNX(
36
+ talker_llm, talker_local, codec_decoder,
37
+ speaker_encoder, talker_codec_embed, text_embed_proj,
38
+ preprocessor_config_dir, model_config, codec_config,
39
+ audio_ref_path, language,
40
+ )
41
+ inferencer.reset_turn(reset_cache=True)
42
+ for delta in llm_stream:
43
+ audio_frames = inferencer.push_text(delta)
44
+ ...
45
+ for audio_tokens in audio_frames:
46
+ ...
47
+ inferencer.push_tokens(audio_tokens)
48
+ for wav in inferencer.audio_chunks():
49
+ ...
50
+ yield wav
51
+ """
52
+
53
+ import base64
54
+ import io
55
+ import json
56
+ import logging
57
+ import re
58
+ import urllib.request
59
+ from typing import Iterable, List, Optional, Tuple, Union
60
+ from urllib.parse import urlparse
61
+
62
+ import librosa
63
+ import numpy as np
64
+ import numpy.typing as npt
65
+ import onnxruntime as ort
66
+ import soundfile as sf
67
+ from box import Box
68
+ from transformers import AutoProcessor
69
+
70
+ from src.core import Qwen3TTSConfig, Qwen3TTSProcessor
71
+ from src.utils import mel_spectrogram_numpy
72
+
73
+ log = logging.getLogger(__name__)
74
+ NDArrayInt = npt.NDArray[np.int64]
75
+ """Typed alias for ``int64`` NumPy arrays used for token sequences."""
76
+
77
+ NDArrayFloat = npt.NDArray[np.floating]
78
+ """Typed alias for floating-point NumPy arrays used for audio waveforms."""
79
+
80
+ AudioLike = Union[
81
+ str, # wav path, URL, base64
82
+ np.ndarray, # waveform (requires sr)
83
+ Tuple[np.ndarray, int], # (waveform, sr)
84
+ ]
85
+
86
+
87
+ class Qwen3TTSInferencerONNX:
88
+ """
89
+ Streaming TTS inference engine backed by six ONNX Runtime sessions.
90
+ This class orchestrates the full Qwen3-TTS-Realtime pipeline using only
91
+ NumPy arrays and ONNX Runtime ``InferenceSession`` objects, with no
92
+ dependency on PyTorch at inference time.
93
+ Architecture overview::
94
+ text deltas --> push_text() --> talker LLM (Qwen3)
95
+ |
96
+ v
97
+ local talker transformer
98
+ |
99
+ audio tokens
100
+ |
101
+ v
102
+ codec decoder --> waveform
103
+ The backbone LLM maintains a growing KV-cache across the entire
104
+ generation. The local transformer creates a fresh KV-cache per audio
105
+ frame (15 autoregressive steps for 15 codebooks) and discards it. The
106
+ codec decoder maintains KV-caches that grow with the decoded audio length
107
+ with 72 window frames, pre-convolution cache 2 frames,
108
+ and convolution upsampling cache with 25 window frames.
109
+ Parameters
110
+ ----------
111
+ talker_model_path : str
112
+ Path to ONNX file for talker model.
113
+ talker_local_model_path : str
114
+ Path to ONNX file for local talker model.
115
+ codec_decoder_model_path : str
116
+ Path to ONNX file for codec decoder model.
117
+ speaker_encoder_model_path : str
118
+ Path to ONNX file for speaker encoder model.
119
+ talker_codec_embed_model_path : str
120
+ Path to ONNX file for talker codec embedding.
121
+ text_embed_proj_model_path : str
122
+ Path to ONNX file for text embedding and projection.
123
+ preprocessor_config_dir : str
124
+ Path to configuration dir that contains preprocessor config files as the original Qwen3-TTS.
125
+ model_config_path : str
126
+ Path to model configuration as the original Qwen3-TTS.
127
+ codec_config_path : str
128
+ Path to codec configuration as the original Qwen3-TTS.
129
+ audio_ref_path : str
130
+ Path to audio reference for voice cloning identity>
131
+ language : str
132
+ Language of the synthesized audio.
133
+ num_threads : int, optional
134
+ Number of threads used in sess.intra_op_num_threads (default ``4``).
135
+ chunk_frames : int, optional
136
+ Number chunk frames in codec decoder forward passes (default ``4`` [0.32 s as it is 12.5 Hz]).
137
+ temperature : float, optional
138
+ Sampling temperature for the local transformer (default ``0.85``).
139
+ top_p : float, optional
140
+ Nucleus sampling threshold (default ``0.8``).
141
+ top_k : int, optional
142
+ Top-k sampling cutoff (default ``50``).
143
+ repetition_penalty : float, optional
144
+ Repetition penalty coefficient (default ``1.9``).
145
+ repetition_window : int, optional
146
+ Number of recent tokens considered for repetition penalty
147
+ (default ``50``).
148
+ """
149
+
150
+ _split_pattern = re.compile(
151
+ r"[。!?!?\.\u2026]\s*" # sentence boundaries: 。!? ! ? . …
152
+ r"|[,,;;::\u2014\u2013\-]\s*" # short pauses: , , ; ; : : — – -
153
+ r"|\)\s*|\]\s*" # closing brackets: ) ]
154
+ r"|\n"
155
+ )
156
+
157
+ def __init__(
158
+ self,
159
+ talker_model_path: str,
160
+ talker_local_model_path: str,
161
+ codec_decoder_model_path: str,
162
+ speaker_encoder_model_path: str,
163
+ talker_codec_embed_model_path: str,
164
+ text_embed_proj_model_path: str,
165
+ preprocessor_config_dir: str,
166
+ model_config_path: str,
167
+ codec_config_path: str,
168
+ audio_ref_path: str,
169
+ language: str,
170
+ num_threads: int = 4,
171
+ chunk_frames: int = 4,
172
+ temperature=0.725,
173
+ top_p=0.6,
174
+ top_k=34,
175
+ repetition_penalty=1.9,
176
+ repetition_window=50,
177
+ ) -> None:
178
+
179
+ opts = ort.SessionOptions()
180
+ opts.intra_op_num_threads = num_threads
181
+ opts.inter_op_num_threads = 1
182
+ opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
183
+ opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
184
+ opts.enable_cpu_mem_arena = True
185
+ opts.enable_mem_pattern = True
186
+ providers = ["CPUExecutionProvider"]
187
+
188
+ logging.info("Loading ONNX sessions...")
189
+ logging.info(f" from {talker_model_path}...")
190
+ self._talker = ort.InferenceSession(talker_model_path, sess_options=opts, providers=providers)
191
+ logging.info(f" from {talker_local_model_path}...")
192
+ self._talker_local = ort.InferenceSession(talker_local_model_path, sess_options=opts, providers=providers)
193
+ logging.info(f" from {codec_decoder_model_path}...")
194
+ self._codec_decoder = ort.InferenceSession(codec_decoder_model_path, sess_options=opts, providers=providers)
195
+ logging.info(f" from {speaker_encoder_model_path}...")
196
+ self._speaker_encoder = ort.InferenceSession(
197
+ speaker_encoder_model_path, sess_options=opts, providers=providers
198
+ )
199
+ logging.info(f" from {talker_codec_embed_model_path}...")
200
+ self._talker_codec_embed = ort.InferenceSession(
201
+ talker_codec_embed_model_path, sess_options=opts, providers=providers
202
+ )
203
+ logging.info(f" from {text_embed_proj_model_path}...")
204
+ self._text_embed_proj = ort.InferenceSession(
205
+ text_embed_proj_model_path, sess_options=opts, providers=providers
206
+ )
207
+ logging.info("[OK] All ONNX sessions loaded.")
208
+
209
+ AutoProcessor.register(Qwen3TTSConfig, Qwen3TTSProcessor)
210
+ self._processor = AutoProcessor.from_pretrained(
211
+ preprocessor_config_dir,
212
+ fix_mistral_regex=True,
213
+ )
214
+
215
+ self._audio_ref_path = audio_ref_path
216
+
217
+ with open(codec_config_path, "r") as f:
218
+ self._speech_tokenizer_config = Box(json.load(f))
219
+
220
+ self._speech_tokenizer_decoder_config = self._speech_tokenizer_config.decoder_config
221
+ self._speech_tokenizer_latent_dim = self._speech_tokenizer_decoder_config.latent_dim
222
+ self._speech_tokenizer_codebook_dim = self._speech_tokenizer_decoder_config.codebook_dim
223
+ self._speech_tokenizer_head_dim = self._speech_tokenizer_decoder_config.head_dim
224
+ self._speech_tokenizer_num_attention_heads = self._speech_tokenizer_decoder_config.num_attention_heads
225
+ self._speech_tokenizer_num_hidden_layers = self._speech_tokenizer_decoder_config.num_hidden_layers
226
+ self._speech_tokenizer_num_key_value_heads = self._speech_tokenizer_decoder_config.num_key_value_heads
227
+ self._speech_tokenizer_sliding_window = self._speech_tokenizer_decoder_config.sliding_window
228
+
229
+ self._speech_tokenizer_decoder_left_context_size = 25
230
+ self._speech_tokenizer_decoder_total_upsample = 1920
231
+
232
+ self.output_sample_rate = self._speech_tokenizer_config.output_sample_rate
233
+
234
+ with open(model_config_path, "r") as f:
235
+ self._config = Box(json.load(f))
236
+
237
+ self._talker_config = self._config.talker_config
238
+ self._code_predictor_config = self._talker_config.code_predictor_config
239
+ self._speaker_encoder_config = self._config.speaker_encoder_config
240
+
241
+ self._speaker_encoder_enc_dim = self._speaker_encoder_config.enc_dim # 1024
242
+ self._speaker_encoder_sample_rate = self._speaker_encoder_config.sample_rate # 24000
243
+
244
+ self._head_dim = self._talker_config.head_dim # 128
245
+ self._hidden_size = self._talker_config.hidden_size # 1024
246
+ self._max_position_embeddings = self._talker_config.max_position_embeddings # 32768
247
+ self._num_attention_heads = self._talker_config.num_attention_heads # 16
248
+ self._num_code_groups = self._talker_config.num_code_groups # 16
249
+ self._num_hidden_layers = self._talker_config.num_hidden_layers # 28
250
+ self._num_key_value_heads = self._talker_config.num_key_value_heads # 8
251
+ self._text_hidden_size = self._talker_config.text_hidden_size # 2048
252
+ self._text_vocab_size = self._talker_config.text_vocab_size # 151936
253
+ self._vocab_size = self._talker_config.vocab_size # 3072
254
+
255
+ self._local_head_dim = self._code_predictor_config.head_dim # 128
256
+ self._local_hidden_size = self._code_predictor_config.hidden_size # 1024
257
+ self._local_max_position_embeddings = self._code_predictor_config.max_position_embeddings # 65536
258
+ self._local_num_attention_heads = self._code_predictor_config.num_attention_heads # 16
259
+ self._local_num_code_groups = self._code_predictor_config.num_code_groups # 16
260
+ self._local_num_hidden_layers = self._code_predictor_config.num_hidden_layers # 5
261
+ self._local_num_key_value_heads = self._code_predictor_config.num_key_value_heads # 8
262
+ self._local_vocab_size = self._code_predictor_config.vocab_size # 2048
263
+
264
+ self._assistant_token_id = self._config.assistant_token_id # 77091
265
+ self._im_end_token_id = self._config.im_end_token_id # 151645
266
+ self._im_start_token_id = self._config.im_start_token_id # 151644
267
+ self._tts_bos_token_id = self._config.tts_bos_token_id # 151672
268
+ self._tts_eos_token_id = self._config.tts_eos_token_id # 151673
269
+ self._tts_pad_token_id = self._config.tts_pad_token_id # 151671
270
+
271
+ self._codec_bos_id = self._talker_config.codec_bos_id # 2149
272
+ self._codec_eos_token_id = self._talker_config.codec_eos_token_id # 2150
273
+ self._codec_think_id = self._talker_config.codec_think_id # 2154
274
+
275
+ # 2048 -> 3072, except 2150 not used in first codebook
276
+ self._suppress_tokens = [i for i in range(self._local_vocab_size, self._vocab_size)]
277
+ if self._codec_eos_token_id in self._suppress_tokens:
278
+ idx_eos = self._suppress_tokens.index(self._codec_eos_token_id)
279
+ del self._suppress_tokens[idx_eos]
280
+ self._suppress_tokens = np.array(self._suppress_tokens, dtype=np.int64)
281
+
282
+ # "chinese": 2055, "english": 2050, "german": 2053, "italian": 2070, "portuguese": 2071,
283
+ # "spanish": 2054, "japanese": 2058, "korean": 2064, "french": 2061, "russian": 2069
284
+ self._codec_language_id = self._talker_config.codec_language_id
285
+ assert (
286
+ language in self.get_supported_languages()
287
+ ), f"language {language} not in {self.get_supported_languages()}"
288
+ self._language = language
289
+
290
+ self._codec_nothink_id = self._talker_config.codec_nothink_id # 2155
291
+ self._codec_pad_id = self._talker_config.codec_pad_id # 2148
292
+ self._codec_think_bos_id = self._talker_config.codec_think_bos_id # 2156
293
+ self._codec_think_eos_id = self._talker_config.codec_think_eos_id # 2157
294
+
295
+ self._temperature = np.array([temperature], dtype=np.float32)
296
+ self._top_p = np.array([top_p], dtype=np.float32)
297
+ self._top_k = np.array([top_k], dtype=np.int64)
298
+ self._repetition_penalty = np.array([repetition_penalty], dtype=np.float32)
299
+ self._repetition_window = np.array([repetition_window], dtype=np.int64)
300
+
301
+ self.text_buffer_size = 32
302
+ self.min_text_chunk_chars = 8
303
+
304
+ self.chunk_frames = chunk_frames
305
+ self.overlap_frames = 0
306
+
307
+ self._max_steps = 2048
308
+
309
+ self._prev_tail: Optional[np.array] = None
310
+ self._buffer: list[np.array] = []
311
+ self._buffer_len = 0
312
+
313
+ self._prefill_key_values_llm = None
314
+ # For talker kv cache, zero length in time-dim
315
+ self._past_key_values_llm = []
316
+ for _ in range(self._num_hidden_layers):
317
+ self._past_key_values_llm.append(
318
+ np.zeros((1, self._num_key_value_heads, 0, self._head_dim), dtype=np.float32)
319
+ ) # key
320
+ self._past_key_values_llm.append(
321
+ np.zeros((1, self._num_key_value_heads, 0, self._head_dim), dtype=np.float32)
322
+ ) # value
323
+ logging.info(
324
+ f"inference init self._past_key_values_llm {self._past_key_values_llm} {self._past_key_values_llm[0].shape}"
325
+ )
326
+
327
+ # For speech tokenizer hidden state cache and kv cache, zero length in time-dim
328
+ self._past_key_values_speech_tokenizer = []
329
+ for _ in range(self._speech_tokenizer_num_hidden_layers):
330
+ self._past_key_values_speech_tokenizer.append(
331
+ np.zeros(
332
+ (
333
+ 1,
334
+ self._speech_tokenizer_num_key_value_heads,
335
+ 0,
336
+ self._speech_tokenizer_head_dim,
337
+ ),
338
+ dtype=np.float32,
339
+ )
340
+ ) # key
341
+ self._past_key_values_speech_tokenizer.append(
342
+ np.zeros(
343
+ (
344
+ 1,
345
+ self._speech_tokenizer_num_key_value_heads,
346
+ 0,
347
+ self._speech_tokenizer_head_dim,
348
+ ),
349
+ dtype=np.float32,
350
+ )
351
+ ) # value
352
+ logging.info(
353
+ f"inference init self._past_key_values_speech_tokenizer {self._past_key_values_speech_tokenizer} {self._past_key_values_speech_tokenizer[0].shape}"
354
+ )
355
+
356
+ self._pre_conv_hidden_state_cache_speech_tokenizer = np.zeros(
357
+ (1, self._speech_tokenizer_codebook_dim, 2), dtype=np.float32
358
+ )
359
+ logging.info(
360
+ f"inference init self._pre_conv_hidden_state_cache_speech_tokenizer {self._pre_conv_hidden_state_cache_speech_tokenizer} {self._pre_conv_hidden_state_cache_speech_tokenizer.shape}"
361
+ )
362
+ self._hidden_state_cache_speech_tokenizer = np.zeros(
363
+ (1, self._speech_tokenizer_latent_dim, 0), dtype=np.float32
364
+ )
365
+ logging.info(
366
+ f"inference init self._hidden_state_cache_speech_tokenizer {self._hidden_state_cache_speech_tokenizer} {self._hidden_state_cache_speech_tokenizer.shape}"
367
+ )
368
+
369
+ # [1, 0, 16]
370
+ self._generated_tokens = np.zeros((1, 0, self._num_code_groups), dtype=np.int64)
371
+ logging.info(f"inference init self._generated_tokens {self._generated_tokens} {self._generated_tokens.shape}")
372
+
373
+ self._is_stopping = False
374
+ self._last_audio_tokens = None
375
+ self._last_first_token = None
376
+ self._last_first_token_embed = None
377
+ self._last_local_tokens_embed = None
378
+ self._last_hidden_states = None
379
+ self._step_idx = 0
380
+
381
+ self._turn_input_ids = None
382
+ self._turn_idx = 0
383
+
384
+ self._text_cache = ""
385
+ self._pending_tokens: list[int] = []
386
+ self._prefilled = False
387
+ self._text_ended = False
388
+
389
+ @property
390
+ def is_finished(self) -> bool:
391
+ return self._is_stopping or self._step_idx >= self._max_steps
392
+
393
+ # ------------------------ HELPERS FOR MAIN FUNCTIONS ------------------------
394
+
395
+ def get_supported_languages(self) -> Optional[List[str]]:
396
+ """
397
+ List supported language names for the current model.
398
+
399
+ This is a convenience wrapper around `model.get_supported_languages()`.
400
+ If the underlying model does not expose language constraints (returns None),
401
+ this method also returns None.
402
+
403
+ Returns:
404
+ Optional[List[str]]:
405
+ - A sorted list of supported language names (lowercased), if available.
406
+ - None if the model does not provide supported languages.
407
+ """
408
+ supported = list(self._config.talker_config.codec_language_id.keys())
409
+ if supported is None:
410
+ return None
411
+ return sorted(set([str(lang).lower() for lang in supported]))
412
+
413
+ def _is_url(self, s: str) -> bool:
414
+ try:
415
+ u = urlparse(s)
416
+ return u.scheme in ("http", "https") and bool(u.netloc)
417
+ except Exception:
418
+ return False
419
+
420
+ def _is_probably_base64(self, s: str) -> bool:
421
+ if s.startswith("data:audio"):
422
+ return True
423
+ if ("/" not in s and "\\" not in s) and len(s) > 256:
424
+ return True
425
+ return False
426
+
427
+ def _decode_base64_to_wav_bytes(self, b64: str) -> bytes:
428
+ if "," in b64 and b64.strip().startswith("data:"):
429
+ b64 = b64.split(",", 1)[1]
430
+ return base64.b64decode(b64)
431
+
432
+ def _load_audio_to_np(self, x: str) -> Tuple[np.ndarray, int]:
433
+ if self._is_url(x):
434
+ with urllib.request.urlopen(x) as resp:
435
+ audio_bytes = resp.read()
436
+ with io.BytesIO(audio_bytes) as f:
437
+ audio, sr = sf.read(f, dtype="float32", always_2d=False)
438
+ elif self._is_probably_base64(x):
439
+ wav_bytes = self._decode_base64_to_wav_bytes(x)
440
+ with io.BytesIO(wav_bytes) as f:
441
+ audio, sr = sf.read(f, dtype="float32", always_2d=False)
442
+ else:
443
+ audio, sr = librosa.load(x, sr=None, mono=True)
444
+
445
+ if audio.ndim > 1:
446
+ audio = np.mean(audio, axis=-1)
447
+
448
+ return audio.astype(np.float32), int(sr)
449
+
450
+ def _normalize_audio_inputs(self, audio: AudioLike) -> Tuple[NDArrayFloat, int]:
451
+ """
452
+ Normalize audio inputs into a list of (waveform, sr).
453
+
454
+ Supported forms:
455
+ - str: wav path / URL / base64 audio string
456
+ - (np.ndarray, sr): waveform + sampling rate
457
+ - list of the above
458
+
459
+ Args:
460
+ audios:
461
+ Audio input(s).
462
+
463
+ Returns:
464
+ List[Tuple[np.ndarray, int]]:
465
+ List of (float32 waveform, original sr).
466
+
467
+ Raises:
468
+ ValueError: If a numpy waveform is provided without sr.
469
+ """
470
+ if isinstance(audio, str):
471
+ audio = self._load_audio_to_np(audio)
472
+ elif isinstance(audio, tuple) and len(audio) == 2 and isinstance(audio[0], np.ndarray):
473
+ audio = (audio.astype(np.float32), int(audio[1]))
474
+ elif isinstance(audio, np.ndarray):
475
+ raise ValueError("For numpy waveform input, pass a tuple (audio, sr).")
476
+ else:
477
+ raise TypeError(f"Unsupported audio input type: {type(audio)}")
478
+ if audio[0].ndim > 1:
479
+ audio[0] = np.mean(audio[0], axis=-1).astype(np.float32)
480
+ audio = (audio[0], audio[1])
481
+ return audio
482
+
483
+ def _build_assistant_text(self) -> str:
484
+ return "<|im_start|>assistant\n"
485
+
486
+ # ------------------------ MAIN FUNCTIONS ------------------------
487
+
488
+ def _tokenize_texts(self, text: Union[str, List[str]]) -> List[int]:
489
+ logging.info(f"_tokenize_texts text {text} {len(text[0]) if isinstance(text, list) else len(text)}")
490
+ input_ids = self._processor(text=text, return_tensors="np", padding=True)
491
+ logging.info(f"_tokenize_texts input_ids_dict {input_ids}")
492
+ input_ids = input_ids["input_ids"]
493
+ logging.info(f"_tokenize_texts input_ids {input_ids} {input_ids.shape}")
494
+ input_ids = np.expand_dims(input_ids, axis=0) if input_ids.ndim == 1 else input_ids
495
+ logging.info(f"_tokenize_texts input_ids_ {input_ids} {input_ids.shape}")
496
+ return list(input_ids[0]) # [B, T] -> [T]
497
+
498
+ def _prefill_embeds(
499
+ self,
500
+ audio_ref_path: str,
501
+ language: Optional[str] = None,
502
+ ) -> NDArrayFloat:
503
+ language_id = None
504
+ if language:
505
+ if language.lower() != "auto":
506
+ if language.lower() not in self._codec_language_id:
507
+ raise NotImplementedError(f"Language {language} not implemented")
508
+ else:
509
+ language_id = self._codec_language_id[language.lower()]
510
+ logging.info(f"_prefill_embeds language_id {language_id}")
511
+ speaker_embed = self.create_voice_clone_spkemb(audio_ref_path) # [B, 1, 512]
512
+ logging.info(f"_prefill_embeds speaker_embed {speaker_embed} {speaker_embed.shape}")
513
+
514
+ # For prefill
515
+ if language_id is not None:
516
+ codec_prefill_list = np.array(
517
+ [
518
+ [
519
+ self._codec_think_id,
520
+ self._codec_think_bos_id,
521
+ language_id,
522
+ self._codec_think_eos_id,
523
+ ]
524
+ ],
525
+ dtype=np.int64,
526
+ )
527
+ else:
528
+ codec_prefill_list = np.array(
529
+ [
530
+ [
531
+ self._codec_nothink_id,
532
+ self._codec_think_bos_id,
533
+ self._codec_think_eos_id,
534
+ ]
535
+ ],
536
+ dtype=np.int64,
537
+ )
538
+ logging.info(f"generate codec_prefill_list {codec_prefill_list}")
539
+ outputs = self._talker_codec_embed.run(["codec_emb"], {"codec_ids": codec_prefill_list})
540
+ codec_input_embedding_0 = outputs[0]
541
+ logging.info(f"generate codec_input_embedding_0 {codec_input_embedding_0} {codec_input_embedding_0.shape}")
542
+ outputs = self._talker_codec_embed.run(
543
+ ["codec_emb"], {"codec_ids": np.array([[self._codec_pad_id]], dtype=np.int64)}
544
+ )
545
+ codec_input_embedding_1 = outputs[0]
546
+ # self_codec_bos_id,
547
+ logging.info(f"generate codec_input_embedding_1 {codec_input_embedding_1} {codec_input_embedding_1.shape}")
548
+ codec_input_embedding = np.concatenate(
549
+ [codec_input_embedding_0, speaker_embed, codec_input_embedding_1], axis=1
550
+ )
551
+ logging.info(f"generate codec_input_embedding {codec_input_embedding} {codec_input_embedding.shape}")
552
+
553
+ # <|im_start|>assistant\n
554
+ prefix_tokens = np.expand_dims(
555
+ np.array(self._tokenize_texts([self._build_assistant_text()]), dtype=np.int64), axis=0
556
+ )
557
+ outputs = self._text_embed_proj.run(["text_emb_out"], {"text_ids": prefix_tokens}) # 3
558
+ _talker_input_embed_role = outputs[0]
559
+ logging.info(f"generate _talker_input_embed_role {_talker_input_embed_role} {_talker_input_embed_role.shape}")
560
+
561
+ outputs = self._text_embed_proj.run(
562
+ ["text_emb_out"],
563
+ {"text_ids": np.array([[self._tts_bos_token_id, self._tts_pad_token_id]], dtype=np.int64)},
564
+ )
565
+ embeds = outputs[0]
566
+ tts_bos_embed, tts_pad_embed = embeds[:, :1], embeds[:, 1:] # 2 * [1 1 d]
567
+ logging.info(f"generate tts_bos_embed {tts_bos_embed} {tts_bos_embed.shape}")
568
+ logging.info(f"generate tts_pad_embed {tts_pad_embed} {tts_pad_embed.shape}")
569
+
570
+ # tts_pad * (4 or 5) + tts_bos; codec_input_embedding_0 (+ speaker_embed) + codec_pad_id --> 5 or 6
571
+ _talker_input_embed = (
572
+ np.concatenate(
573
+ (
574
+ np.broadcast_to(
575
+ tts_pad_embed,
576
+ (tts_pad_embed.shape[0], codec_input_embedding.shape[1] - 1, tts_pad_embed.shape[2]),
577
+ ), # 4 or 5
578
+ tts_bos_embed, # 1
579
+ ),
580
+ axis=1,
581
+ )
582
+ + codec_input_embedding # 5 or 6
583
+ )
584
+ logging.info(f"generate _talker_input_embed {_talker_input_embed} {_talker_input_embed.shape}")
585
+
586
+ talker_input_embed = np.concatenate((_talker_input_embed_role, _talker_input_embed), axis=1) # 3 + 5/6
587
+ logging.info(f"generate talker_input_embed {talker_input_embed} {talker_input_embed.shape}")
588
+
589
+ return talker_input_embed
590
+
591
+ def create_voice_clone_spkemb(
592
+ self,
593
+ ref_audio: AudioLike,
594
+ ) -> NDArrayFloat:
595
+ normalized = self._normalize_audio_inputs(ref_audio)
596
+ logging.info(f"create_voice_clone_prompt normalized {normalized} {normalized[0][0].shape} {normalized[0][1]}")
597
+
598
+ wav, sr = normalized
599
+ wav_resample = wav
600
+ if sr != self._speaker_encoder_sample_rate:
601
+ wav_resample = librosa.resample(
602
+ y=wav_resample.astype(np.float32),
603
+ orig_sr=int(sr),
604
+ target_sr=self._speaker_encoder_sample_rate,
605
+ )
606
+
607
+ logging.info(
608
+ f"create_voice_clone_spkemb wav_resample {wav_resample} {wav_resample.shape} {wav_resample.dtype}"
609
+ )
610
+ mels = mel_spectrogram_numpy(
611
+ wav_resample,
612
+ n_fft=1024,
613
+ num_mels=128,
614
+ sampling_rate=24000,
615
+ hop_size=256,
616
+ win_size=1024,
617
+ fmin=0,
618
+ fmax=12000,
619
+ )
620
+ logging.info(f"create_voice_clone_spkemb mels {mels} {mels.shape} {mels.dtype}")
621
+
622
+ outputs = self._speaker_encoder.run(["speaker_embedding"], {"mel_spec": mels})
623
+ spk_emb = outputs[0]
624
+ logging.info(f"create_voice_clone_prompt spk_emb {spk_emb} {spk_emb.shape}")
625
+
626
+ return spk_emb
627
+
628
+ def generate_local_transformer(self) -> None:
629
+ feed = {
630
+ "past_hidden": self._last_hidden_states,
631
+ "past_id_hidden": self._last_first_token_embed,
632
+ "generated_tokens": self._generated_tokens[..., 1:],
633
+ "temperature": self._temperature,
634
+ "top_p": self._top_p,
635
+ "top_k": self._top_k,
636
+ "repetition_penalty": self._repetition_penalty,
637
+ "repetition_window": self._repetition_window,
638
+ }
639
+ output_names = ["outputs_tokens", "outputs_embeds"]
640
+ logging.info(
641
+ f"generate_local_transformer self._last_hidden_states {self._last_hidden_states} {self._last_hidden_states.shape} {self._last_hidden_states.dtype}"
642
+ )
643
+ logging.info(
644
+ f"generate_local_transformer self._last_first_token_embed {self._last_first_token_embed} {self._last_first_token_embed.shape} {self._last_first_token_embed.dtype}"
645
+ )
646
+ outputs = self._talker_local.run(output_names, feed)
647
+ local_tokens, self._last_local_tokens_embed = outputs[0], outputs[1]
648
+ logging.info(
649
+ f"generate_local_transformer local_tokens {local_tokens} {local_tokens.shape} {local_tokens.dtype}"
650
+ )
651
+ logging.info(
652
+ f"generate_local_transformer self._last_local_tokens_embed {self._last_local_tokens_embed} {self._last_local_tokens_embed.shape} {self._last_local_tokens_embed.dtype}"
653
+ )
654
+ self._last_audio_tokens = np.concatenate(
655
+ (np.expand_dims(self._last_first_token, axis=-1), local_tokens), axis=1
656
+ )[None, :, :]
657
+ logging.info(
658
+ f"generate_local_transformer self._last_audio_tokens {self._last_audio_tokens} {self._last_audio_tokens.shape} {self._last_audio_tokens.dtype}"
659
+ )
660
+ self._generated_tokens = np.concatenate((self._generated_tokens, self._last_audio_tokens), axis=1)
661
+ logging.info(
662
+ f"generate_local_transformer self._generated_tokens {self._generated_tokens.shape} {self._generated_tokens.dtype}"
663
+ )
664
+ return
665
+
666
+ def _set_talker_zero_kv_cache(self, batch_size=1):
667
+ """Set talker zero KV cache for all layers."""
668
+ kv = {}
669
+ for i in range(self._num_hidden_layers):
670
+ kv[f"past_key_{i}"] = np.zeros(
671
+ (batch_size, self._num_key_value_heads, 0, self._head_dim), dtype=np.float32
672
+ )
673
+ kv[f"past_value_{i}"] = np.zeros(
674
+ (batch_size, self._num_key_value_heads, 0, self._head_dim), dtype=np.float32
675
+ )
676
+ return kv
677
+
678
+ def _set_talker_kv_cache(self):
679
+ """Set talker KV cache for all layers."""
680
+ kv = {}
681
+ for i in range(self._num_hidden_layers):
682
+ kv[f"past_key_{i}"] = self._past_key_values_llm[2 * i]
683
+ kv[f"past_value_{i}"] = self._past_key_values_llm[2 * i + 1]
684
+ return kv
685
+
686
+ def _set_codec_decoder_kv_cache(self, past_key_values):
687
+ """Set talker KV cache for all layers."""
688
+ kv = {}
689
+ for i in range(self._speech_tokenizer_num_hidden_layers):
690
+ kv[f"past_key_{i}"] = past_key_values[2 * i]
691
+ kv[f"past_value_{i}"] = past_key_values[2 * i + 1]
692
+ return kv
693
+
694
+ def prefill(self) -> None:
695
+ inputs_embeds = self._prefill_embeds(self._audio_ref_path, self._language)
696
+ logging.info(f"prefill inputs_embeds {inputs_embeds} {inputs_embeds.shape}")
697
+
698
+ kv_cache = self._set_talker_zero_kv_cache(batch_size=1)
699
+ feed = {
700
+ "inputs_embeds": inputs_embeds,
701
+ "generated_tokens": self._generated_tokens[..., 0],
702
+ "temperature": self._temperature,
703
+ "top_p": self._top_p,
704
+ "top_k": self._top_k,
705
+ "repetition_penalty": self._repetition_penalty,
706
+ "repetition_window": self._repetition_window,
707
+ }
708
+ feed.update(kv_cache)
709
+
710
+ output_names = ["logits", "token", "token_embed", "hidden_states"]
711
+ for i in range(self._num_hidden_layers):
712
+ output_names.extend([f"present_key_{i}", f"present_value_{i}"])
713
+
714
+ logging.info(f"prefill inputs_embeds {inputs_embeds} {inputs_embeds.shape} {inputs_embeds.dtype}")
715
+ logging.info(f"prefill self._prefill_key_values_llm before {self._prefill_key_values_llm}")
716
+ outputs = self._talker.run(output_names, feed)
717
+ # logits, self._last_first_token, self._last_first_token_embed, self._last_hidden_states, self._prefill_key_values_llm = outputs[0], outputs[1], outputs[2], outputs[3:]
718
+ _, _, _, _, self._prefill_key_values_llm = outputs[0], outputs[1], outputs[2], outputs[3], outputs[4:]
719
+ logging.info(
720
+ f"prefill self._prefill_key_values_llm after {self._prefill_key_values_llm[0].shape} {len(self._prefill_key_values_llm)}"
721
+ )
722
+ self._prefilled = True
723
+ return
724
+
725
+ def step(
726
+ self,
727
+ text_token: Optional[int] = None, # [B, 1]
728
+ ) -> Union[NDArrayInt, None]:
729
+ if not self._prefilled:
730
+ raise ValueError("You must call prefill() before step().")
731
+ if self.is_finished:
732
+ return self._last_audio_tokens
733
+
734
+ # last codec embeds
735
+ if self._step_idx > 0:
736
+ logging.info(
737
+ f"step-{self._step_idx} self._last_first_token_embed {self._last_first_token_embed.shape} {self._last_first_token_embed.dtype}"
738
+ )
739
+ logging.info(
740
+ f"step-{self._step_idx} self._last_local_tokens_embed {self._last_local_tokens_embed.shape} {self._last_local_tokens_embed.dtype}"
741
+ )
742
+ codec_embeds = self._last_first_token_embed + self._last_local_tokens_embed
743
+ logging.info(f"step-{self._step_idx} codec_embeds {codec_embeds.shape} {codec_embeds.dtype}")
744
+ else:
745
+ self._past_key_values_llm = self._prefill_key_values_llm
746
+ outputs = self._talker_codec_embed.run(
747
+ ["codec_emb"], {"codec_ids": np.array([[self._codec_bos_id]], dtype=np.int64)}
748
+ )
749
+ codec_embeds = outputs[0]
750
+ logging.info(f"step-{self._step_idx} codec_embeds {codec_embeds.shape} {codec_embeds.dtype}")
751
+ # tts token step
752
+ if text_token is not None:
753
+ text_token = np.array([[text_token]], dtype=np.int64)
754
+ logging.info(f"step-{self._step_idx} text_token not None {text_token} {text_token.shape}")
755
+ else:
756
+ text_token = np.array([[self._tts_pad_token_id]], dtype=np.int64)
757
+ logging.info(f"step-{self._step_idx} text_token None {text_token} {text_token.shape}")
758
+ outputs = self._text_embed_proj.run(["text_emb_out"], {"text_ids": text_token})
759
+ text_embeds = outputs[0]
760
+ logging.info(f"step-{self._step_idx} text_embeds {text_embeds} {text_embeds.shape} {text_embeds.dtype}")
761
+ inputs_embeds = text_embeds + codec_embeds
762
+ logging.info(
763
+ f"step-{self._step_idx} inputs_embeds {inputs_embeds} {inputs_embeds.shape} {inputs_embeds.dtype}"
764
+ )
765
+
766
+ kv_cache = self._set_talker_kv_cache()
767
+ feed = {
768
+ "inputs_embeds": inputs_embeds,
769
+ "generated_tokens": self._generated_tokens[..., 0],
770
+ "temperature": self._temperature,
771
+ "top_p": self._top_p,
772
+ "top_k": self._top_k,
773
+ "repetition_penalty": self._repetition_penalty,
774
+ "repetition_window": self._repetition_window,
775
+ }
776
+ feed.update(kv_cache)
777
+
778
+ output_names = ["logits", "token", "token_embed", "hidden_states"]
779
+ for i in range(self._num_hidden_layers):
780
+ output_names.extend([f"present_key_{i}", f"present_value_{i}"])
781
+
782
+ outputs = self._talker.run(output_names, feed)
783
+ (
784
+ _,
785
+ self._last_first_token,
786
+ self._last_first_token_embed,
787
+ self._last_hidden_states,
788
+ self._past_key_values_llm,
789
+ ) = (outputs[0], outputs[1], outputs[2], outputs[3], outputs[4:])
790
+ logging.info(
791
+ f"step-{self._step_idx} self._last_first_token {self._last_first_token} {self._last_first_token.shape} {self._last_first_token.dtype}"
792
+ )
793
+ logging.info(
794
+ f"step-{self._step_idx} self._last_first_token_embed {self._last_first_token_embed.shape} {self._last_first_token_embed.dtype}"
795
+ )
796
+ logging.info(
797
+ f"step-{self._step_idx} self._last_hidden_states {self._last_hidden_states.shape} {self._last_hidden_states.dtype}"
798
+ )
799
+ logging.info(
800
+ f"step-{self._step_idx} self._past_key_values_llm {self._past_key_values_llm[0].shape} {self._past_key_values_llm[0].dtype}"
801
+ )
802
+
803
+ self._is_stopping = self._last_first_token == self._codec_eos_token_id
804
+ if self.is_finished:
805
+ return None
806
+
807
+ self.generate_local_transformer()
808
+ logging.info(
809
+ f"step-{self._step_idx} self._last_audio_tokens {self._last_audio_tokens} {self._last_audio_tokens.shape} {self._last_audio_tokens.dtype}"
810
+ )
811
+ logging.info(
812
+ f"step-{self._step_idx} self._generated_tokens {self._generated_tokens.shape} {self._generated_tokens.dtype}"
813
+ )
814
+ self._step_idx += 1
815
+ return self._last_audio_tokens
816
+
817
+ # ------------------------ STREAMING HELPERS ------------------------
818
+
819
+ def _drain_pending_tokens(self) -> list[NDArrayInt]:
820
+ outputs: list[NDArrayInt] = []
821
+ if not self._prefilled:
822
+ self.prefill()
823
+ return outputs
824
+
825
+ while self._pending_tokens and not self.is_finished:
826
+ logging.info(f"pending_tokens before pop {self._pending_tokens}")
827
+ token = self._pending_tokens.pop(0)
828
+ logging.info(f"token {token}")
829
+ logging.info(f"pending_tokens after pop {self._pending_tokens}")
830
+ output = self.step(token)
831
+ if output is not None:
832
+ outputs.append(output)
833
+ logging.info(f"outputs {outputs} {len(outputs)}")
834
+
835
+ return outputs
836
+
837
+ def end_text(self) -> list[NDArrayFloat]:
838
+ self._text_ended = True
839
+ if self._text_cache:
840
+ self._pending_tokens.extend(self._tokenize_texts([self._text_cache]))
841
+ self._text_cache = ""
842
+ return self._drain_pending_tokens()
843
+
844
+ def drain(self, max_steps: Optional[int] = None) -> list[NDArrayFloat]:
845
+ if not self._prefilled:
846
+ return []
847
+ return self.finish(max_steps=max_steps)
848
+
849
+ def _extract_text_segments(self, force: bool) -> list[str]:
850
+ segments = []
851
+ if force:
852
+ if self._text_cache:
853
+ segments.append(self._text_cache)
854
+ self._text_cache = ""
855
+ return segments
856
+
857
+ while self._text_cache:
858
+ cut_idx = None
859
+ if len(self._text_cache) >= self.min_text_chunk_chars:
860
+ matches = list(self._split_pattern.finditer(self._text_cache))
861
+ for match in matches:
862
+ if match.end() >= self.min_text_chunk_chars:
863
+ cut_idx = match.end()
864
+ break
865
+ if cut_idx is None and len(self._text_cache) >= self.text_buffer_size:
866
+ whitespace_idx = self._text_cache.rfind(" ")
867
+ if whitespace_idx != -1:
868
+ cut_idx = whitespace_idx + 1
869
+ if cut_idx is None:
870
+ break
871
+ segments.append(self._text_cache[:cut_idx])
872
+ self._text_cache = self._text_cache[cut_idx:]
873
+ return segments
874
+
875
+ def push_text(self, text_fragment: str) -> list[NDArrayFloat]:
876
+ logging.info(f"text_cache before {self._text_cache}")
877
+ logging.info(f"text_fragment {text_fragment}")
878
+ self._text_cache += text_fragment
879
+ logging.info(f"text_cache after {self._text_cache}")
880
+ segments = self._extract_text_segments(force=False)
881
+ logging.info(f"segments {segments}")
882
+ for segment in segments:
883
+ logging.info(f"segment {segment}")
884
+ logging.info(f"pending_tokens before {self._pending_tokens}")
885
+ self._pending_tokens.extend(self._tokenize_texts([segment]))
886
+ logging.info(f"pending_tokens after {self._pending_tokens}")
887
+ return self._drain_pending_tokens()
888
+
889
+ def push_tokens(self, audio_tokens: NDArrayInt):
890
+ if audio_tokens.ndim != 2:
891
+ raise ValueError(f"Expected [T, C] audio tokens, got {tuple(audio_tokens.shape)}")
892
+ self._buffer.append(audio_tokens)
893
+ self._buffer_len += audio_tokens.shape[0]
894
+ logging.info(
895
+ f"push_tokens audio_tokens {audio_tokens} {audio_tokens.shape} self.buffer {self._buffer} {self._buffer_len}"
896
+ )
897
+
898
+ def _overlap_samples(self, wav: NDArrayFloat) -> int:
899
+ if self.chunk_frames <= 0:
900
+ return 0
901
+ return int(wav.size * (self.overlap_frames / self.chunk_frames))
902
+
903
+ def _apply_crossfade(self, wav: NDArrayFloat, final_chunk: bool = False) -> NDArrayFloat:
904
+ if self.overlap_frames <= 0:
905
+ return wav
906
+
907
+ overlap = self._overlap_samples(wav)
908
+ if overlap == 0:
909
+ return wav
910
+
911
+ if self._prev_tail is None:
912
+ self._prev_tail = wav[-overlap:].copy() if not final_chunk else None
913
+ return wav
914
+
915
+ prev_tail = self._prev_tail
916
+ if prev_tail.size < overlap:
917
+ overlap = prev_tail.size
918
+ if overlap == 0:
919
+ return wav
920
+
921
+ fade_out = np.linspace(1.0, 0.0, overlap, dtype=wav.dtype)
922
+ fade_in = 1.0 - fade_out
923
+
924
+ cross = prev_tail[-overlap:] * fade_out + wav[:overlap] * fade_in
925
+ merged = np.concatenate([prev_tail[:-overlap], cross, wav[overlap:]], axis=-1)
926
+
927
+ self._prev_tail = None if final_chunk else wav[-overlap:].copy()
928
+ return merged
929
+
930
+ def _process_frames_to_audio(self, chunk_frames_length: int) -> NDArrayFloat:
931
+ chunk_tokens = self._consume_frames(self.chunk_frames)
932
+ # pad left for pre_conv inside with cache
933
+ logging.info(f"_process_frames_to_audio chunk_tokens {chunk_tokens} {chunk_tokens.shape}")
934
+ # past key values with sliding windows
935
+ len_for_past_key_values = self._speech_tokenizer_sliding_window - chunk_tokens.shape[-1]
936
+ logging.info(
937
+ f"_process_frames_to_audio len_for_past_key_values {len_for_past_key_values} {self._past_key_values_speech_tokenizer[0].shape}"
938
+ )
939
+ past_key_values = [
940
+ past_kv[:, :, -len_for_past_key_values:] for past_kv in self._past_key_values_speech_tokenizer
941
+ ]
942
+ logging.info(f"_process_frames_to_audio past_key_values {past_key_values[0].shape} {len_for_past_key_values}")
943
+ # pad hidden_state_cache for input to upsampling conv with left context size
944
+ hidden_state_cache = self._hidden_state_cache_speech_tokenizer
945
+ len_hidden_state_cache = self._hidden_state_cache_speech_tokenizer.shape[-1]
946
+ logging.info(
947
+ f"_process_frames_to_audio hidden_state_cache {hidden_state_cache.shape} {len_hidden_state_cache}"
948
+ )
949
+
950
+ kv_cache = self._set_codec_decoder_kv_cache(past_key_values)
951
+ feed = {
952
+ "codes": chunk_tokens,
953
+ "hidden_state_cache": hidden_state_cache,
954
+ "pre_conv_hidden_state_cache": self._pre_conv_hidden_state_cache_speech_tokenizer,
955
+ }
956
+ feed.update(kv_cache)
957
+
958
+ output_names = ["wav", "current_hidden_state_cache", "current_pre_conv_hidden_state_cache"]
959
+ for i in range(self._speech_tokenizer_num_hidden_layers):
960
+ output_names.extend([f"present_key_{i}", f"present_value_{i}"])
961
+
962
+ outputs = self._codec_decoder.run(output_names, feed)
963
+ wav, hidden_state_cache, self._pre_conv_hidden_state_cache_speech_tokenizer, past_key_values = (
964
+ outputs[0],
965
+ outputs[1],
966
+ outputs[2],
967
+ outputs[3:],
968
+ )
969
+
970
+ self._past_key_values_speech_tokenizer = [
971
+ past_kv[:, :, -self._speech_tokenizer_sliding_window + 1 :] for past_kv in past_key_values
972
+ ]
973
+ self._hidden_state_cache_speech_tokenizer = hidden_state_cache[
974
+ :, :, -self._speech_tokenizer_decoder_left_context_size :
975
+ ]
976
+ logging.info(
977
+ f"_process_frames_to_audio self._past_key_values_speech_tokenizer {self._past_key_values_speech_tokenizer[0].shape}"
978
+ )
979
+ logging.info(
980
+ f"_process_frames_to_audio self._hidden_state_cache_speech_tokenizer {self._hidden_state_cache_speech_tokenizer.shape}"
981
+ )
982
+ logging.info(
983
+ f"_process_frames_to_audio self._pre_conv_hidden_state_cache_speech_tokenizer {self._pre_conv_hidden_state_cache_speech_tokenizer.shape}"
984
+ )
985
+ logging.info(f"_process_frames_to_audio wav before {wav} {wav.shape}")
986
+ wav = wav[..., len_hidden_state_cache * self._speech_tokenizer_decoder_total_upsample :]
987
+ logging.info(f"_process_frames_to_audio wav after {wav} {wav.shape}")
988
+ return wav
989
+
990
+ def flush(self) -> Optional[NDArrayFloat]:
991
+ if self._buffer_len == 0:
992
+ return None
993
+ logging.info(f"flush buffer_len {self._buffer_len}")
994
+ wav = self._process_frames_to_audio(self._buffer_len)
995
+ return self._apply_crossfade(wav, final_chunk=True)
996
+
997
+ def _consume_frames(self, num_frames: int) -> NDArrayFloat:
998
+ frames = []
999
+ remaining = num_frames
1000
+ while remaining > 0 and self._buffer:
1001
+ head = self._buffer[0]
1002
+ if head.shape[0] <= remaining:
1003
+ frames.append(head)
1004
+ remaining -= head.shape[0]
1005
+ self._buffer.pop(0)
1006
+ else:
1007
+ frames.append(head[:remaining])
1008
+ self._buffer[0] = head[remaining:]
1009
+ remaining = 0
1010
+ self._buffer_len -= num_frames - remaining
1011
+ return np.expand_dims(np.transpose(np.concatenate(frames, axis=0), (1, 0)), axis=0)
1012
+
1013
+ def audio_chunks(self) -> Iterable[NDArrayFloat]:
1014
+ while self._buffer_len >= self.chunk_frames:
1015
+ logging.info(f"audio_chunks buffer_len chunk_frames {self._buffer_len} {self.chunk_frames}")
1016
+ wav = self._process_frames_to_audio(self.chunk_frames)
1017
+ yield self._apply_crossfade(wav)
1018
+
1019
+ def finish(self, max_steps: Optional[int] = None) -> list[NDArrayInt]:
1020
+ outputs = []
1021
+ steps_left = max_steps if max_steps is not None else self.max_length
1022
+ while steps_left > 0 and not self.is_finished:
1023
+ output = self.step(text_token=None)
1024
+ if output is not None:
1025
+ outputs.append(output)
1026
+ steps_left -= 1
1027
+ return outputs
1028
+
1029
+ # ------------------------ STATE RESET HELPERS ------------------------
1030
+
1031
+ def reset_generation_state(self, keep_prefill_cache: bool = True) -> None:
1032
+ self._past_key_values_llm = []
1033
+ for _ in range(self._num_hidden_layers):
1034
+ self._past_key_values_llm.append(
1035
+ np.zeros((1, self._num_key_value_heads, 0, self._head_dim), dtype=np.float32)
1036
+ ) # key
1037
+ self._past_key_values_llm.append(
1038
+ np.zeros((1, self._num_key_value_heads, 0, self._head_dim), dtype=np.float32)
1039
+ ) # value
1040
+ logging.info(
1041
+ f"reset self._past_key_values_llm {self._past_key_values_llm} {self._past_key_values_llm[0].shape}"
1042
+ )
1043
+
1044
+ self._past_key_values_speech_tokenizer = []
1045
+ for _ in range(self._speech_tokenizer_num_hidden_layers):
1046
+ self._past_key_values_speech_tokenizer.append(
1047
+ np.zeros(
1048
+ (
1049
+ 1,
1050
+ self._speech_tokenizer_num_key_value_heads,
1051
+ 0,
1052
+ self._speech_tokenizer_head_dim,
1053
+ ),
1054
+ dtype=np.float32,
1055
+ )
1056
+ ) # key
1057
+ self._past_key_values_speech_tokenizer.append(
1058
+ np.zeros(
1059
+ (
1060
+ 1,
1061
+ self._speech_tokenizer_num_key_value_heads,
1062
+ 0,
1063
+ self._speech_tokenizer_head_dim,
1064
+ ),
1065
+ dtype=np.float32,
1066
+ )
1067
+ ) # value
1068
+ logging.info(
1069
+ f"reset self._past_key_values_speech_tokenizer {self._past_key_values_speech_tokenizer} {self._past_key_values_speech_tokenizer[0].shape}"
1070
+ )
1071
+
1072
+ self._pre_conv_hidden_state_cache_speech_tokenizer = np.zeros(
1073
+ (1, self._speech_tokenizer_codebook_dim, 2), dtype=np.float32
1074
+ )
1075
+ logging.info(
1076
+ f"reset self._pre_conv_hidden_state_cache_speech_tokenizer {self._pre_conv_hidden_state_cache_speech_tokenizer} {self._pre_conv_hidden_state_cache_speech_tokenizer.shape}"
1077
+ )
1078
+ self._hidden_state_cache_speech_tokenizer = np.zeros(
1079
+ (1, self._speech_tokenizer_latent_dim, 0), dtype=np.float32
1080
+ )
1081
+ logging.info(
1082
+ f"reset self._hidden_state_cache_speech_tokenizer {self._hidden_state_cache_speech_tokenizer} {self._hidden_state_cache_speech_tokenizer.shape}"
1083
+ )
1084
+
1085
+ # [1, 0, 16]
1086
+ self._generated_tokens = np.zeros((1, 0, self._num_code_groups), dtype=np.int64)
1087
+ logging.info(f"reset self._generated_tokens {self._generated_tokens} {self._generated_tokens.shape}")
1088
+
1089
+ if not keep_prefill_cache:
1090
+ self._prefill_key_values_llm = None
1091
+ self._prefilled = False
1092
+
1093
+ self._is_stopping = None
1094
+ self._last_audio_tokens = None
1095
+ self._last_first_token = None
1096
+ self._last_first_token_embed = None
1097
+ self._last_hidden_states = None
1098
+ self._step_idx = 0
1099
+
1100
+ return
1101
+
1102
+ def reset_turn(self, reset_cache: bool = False) -> None:
1103
+ self._turn_idx += 1
1104
+
1105
+ self._text_cache = ""
1106
+ self._pending_tokens = []
1107
+ self._prefilled = False
1108
+ self._text_ended = False
1109
+
1110
+ self.reset_generation_state(keep_prefill_cache=True)
1111
+
1112
+ return
src/utils/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .audio_utils import mel_spectrogram_numpy
src/utils/audio_utils.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2026 Patrick Lumbantobing, Vertox-AI
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ """
16
+ Utilities functions and classes for audio processing.
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ import numpy as np
22
+ import numpy.typing as npt
23
+
24
+
25
+ def hz_to_mel(freq: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
26
+ """
27
+ Convert Hz to mel using the HTK formula.
28
+
29
+ Args:
30
+ freq: Frequencies in Hz.
31
+
32
+ Returns:
33
+ Frequencies in mel.
34
+ """
35
+ return 2595.0 * np.log10(1.0 + freq / 700.0)
36
+
37
+
38
+ def mel_to_hz(mels: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
39
+ """
40
+ Convert mel to Hz using the HTK formula.
41
+
42
+ Args:
43
+ mels: Values in mel.
44
+
45
+ Returns:
46
+ Frequencies in Hz.
47
+ """
48
+ return 700.0 * (10.0 ** (mels / 2595.0) - 1.0)
49
+
50
+
51
+ def librosa_style_mel_filterbank(
52
+ *,
53
+ sr: int,
54
+ n_fft: int,
55
+ n_mels: int,
56
+ fmin: float,
57
+ fmax: float | None = None,
58
+ norm: str | None = "slaney",
59
+ ) -> npt.NDArray[np.float32]:
60
+ """
61
+ Build a mel filterbank compatible with librosa.filters.mel using Slaney normalization.
62
+
63
+ Args:
64
+ sr: Sample rate.
65
+ n_fft: FFT size.
66
+ n_mels: Number of mel bins.
67
+ fmin: Minimum frequency in Hz.
68
+ fmax: Maximum frequency in Hz. If None, defaults to sr / 2.
69
+ norm: If "slaney", apply area normalization.
70
+
71
+ Returns:
72
+ Mel filterbank with shape [n_mels, n_fft // 2 + 1].
73
+ """
74
+ if fmax is None:
75
+ fmax = sr / 2.0
76
+
77
+ n_freqs = n_fft // 2 + 1
78
+ freqs = np.linspace(0.0, sr / 2.0, n_freqs, dtype=np.float64)
79
+
80
+ m_min = hz_to_mel(np.array([fmin], dtype=np.float64))[0]
81
+ m_max = hz_to_mel(np.array([fmax], dtype=np.float64))[0]
82
+ m_pts = np.linspace(m_min, m_max, n_mels + 2, dtype=np.float64)
83
+ hz_pts = mel_to_hz(m_pts)
84
+
85
+ fb = np.zeros((n_mels, n_freqs), dtype=np.float64)
86
+
87
+ for i in range(n_mels):
88
+ left, center, right = hz_pts[i], hz_pts[i + 1], hz_pts[i + 2]
89
+
90
+ left_slope = (freqs - left) / (center - left + 1e-10)
91
+ right_slope = (right - freqs) / (right - center + 1e-10)
92
+
93
+ fb[i] = np.maximum(0.0, np.minimum(left_slope, right_slope))
94
+
95
+ if norm == "slaney":
96
+ # Match Slaney-style area normalization used by librosa/torchaudio.
97
+ enorm = 2.0 / (hz_pts[2:] - hz_pts[:-2])
98
+ fb *= enorm[:, None]
99
+
100
+ return fb.astype(np.float32)
101
+
102
+
103
+ def dynamic_range_compression_np(
104
+ x: npt.NDArray[np.float32],
105
+ C: float = 1.0,
106
+ clip_val: float = 1e-5,
107
+ ) -> npt.NDArray[np.float32]:
108
+ """
109
+ NumPy equivalent of torch.log(torch.clamp(x, min=clip_val) * C).
110
+
111
+ Args:
112
+ x: Input array.
113
+ C: Multiplicative constant.
114
+ clip_val: Minimum allowed value before log.
115
+
116
+ Returns:
117
+ Log-compressed array.
118
+ """
119
+ return np.log(np.clip(x * C, a_min=clip_val, a_max=None)).astype(np.float32)
120
+
121
+
122
+ def _reflect_pad_1d(x: npt.NDArray[np.float32], pad: int) -> npt.NDArray[np.float32]:
123
+ """
124
+ Reflect-pad a [1, T] waveform along the time axis.
125
+
126
+ Args:
127
+ x: Waveform with shape [1, T].
128
+ pad: Number of samples to pad on each side.
129
+
130
+ Returns:
131
+ Padded waveform with shape [1, T + 2 * pad].
132
+ """
133
+ if pad == 0:
134
+ return x
135
+ left = x[:, 1 : pad + 1][:, ::-1]
136
+ right = x[:, -pad - 1 : -1][:, ::-1]
137
+ return np.concatenate([left, x, right], axis=1)
138
+
139
+
140
+ def _stft_magnitude(
141
+ y: npt.NDArray[np.float32],
142
+ *,
143
+ n_fft: int,
144
+ hop_size: int,
145
+ win_size: int,
146
+ center: bool,
147
+ ) -> npt.NDArray[np.float32]:
148
+ """
149
+ Compute magnitude STFT for a single-channel waveform.
150
+
151
+ Args:
152
+ y: Input waveform of shape [1, T].
153
+ n_fft: FFT size.
154
+ hop_size: Hop size between frames.
155
+ win_size: Window size.
156
+ center: Whether to pad the input before framing.
157
+
158
+ Returns:
159
+ Magnitude spectrogram with shape [1, frames, n_fft // 2 + 1].
160
+ """
161
+ if y.ndim != 2 or y.shape[0] != 1:
162
+ raise ValueError("Expected waveform shape [1, T].")
163
+
164
+ x = y.astype(np.float32, copy=False)
165
+
166
+ if center:
167
+ pad = n_fft // 2
168
+ x = _reflect_pad_1d(x, pad)
169
+
170
+ if x.shape[1] < n_fft:
171
+ raise ValueError("Input is too short for the requested n_fft.")
172
+
173
+ num_frames = 1 + (x.shape[1] - n_fft) // hop_size
174
+ frame_starts = hop_size * np.arange(num_frames, dtype=np.int64)
175
+ frame_offsets = np.arange(n_fft, dtype=np.int64)
176
+
177
+ frames = x[:, frame_starts[:, None] + frame_offsets[None, :]] # [1, frames, n_fft]
178
+
179
+ window = np.hanning(win_size).astype(np.float32)
180
+ if n_fft > win_size:
181
+ pad_left = (n_fft - win_size) // 2
182
+ pad_right = n_fft - win_size - pad_left
183
+ window = np.pad(window, (pad_left, pad_right))
184
+ elif n_fft < win_size:
185
+ window = window[:n_fft]
186
+
187
+ frames = frames * window[None, None, :]
188
+
189
+ spec = np.fft.rfft(frames, n=n_fft, axis=-1)
190
+ mag = np.sqrt(np.real(spec) ** 2 + np.imag(spec) ** 2 + 1e-9).astype(np.float32)
191
+ return mag
192
+
193
+
194
+ def mel_spectrogram_numpy(
195
+ y: npt.NDArray[np.float32],
196
+ n_fft: int,
197
+ num_mels: int,
198
+ sampling_rate: int,
199
+ hop_size: int,
200
+ win_size: int,
201
+ fmin: int,
202
+ fmax: int | None = None,
203
+ center: bool = False,
204
+ clip_val: float = 1e-5,
205
+ ) -> npt.NDArray[np.float32]:
206
+ """
207
+ Compute a mel spectrogram in pure NumPy, matching the torch/torchaudio pipeline.
208
+
209
+ This mirrors:
210
+ - librosa.filters.mel(..., norm="slaney")
211
+ - Hann window STFT
212
+ - power-magnitude spectrogram
213
+ - log compression with clipping
214
+
215
+ Args:
216
+ y: Waveform with shape [1, T].
217
+ n_fft: FFT size.
218
+ num_mels: Number of mel bins.
219
+ sampling_rate: Sampling rate in Hz.
220
+ hop_size: Hop size between frames.
221
+ win_size: Window size.
222
+ fmin: Minimum mel frequency in Hz.
223
+ fmax: Maximum mel frequency in Hz. If None, defaults to sr / 2.
224
+ center: Whether to pad the signal before framing.
225
+ clip_val: Minimum value before log compression.
226
+
227
+ Returns:
228
+ Mel spectrogram with shape [1, num_mels, frames].
229
+ """
230
+ if y.ndim == 1:
231
+ y = np.expand_dims(y, axis=0)
232
+ elif y.ndim == 2 and y.shape[0] != 1:
233
+ raise ValueError("Expected waveform shape [1, T].")
234
+ elif y.ndim > 2:
235
+ raise ValueError("Expected waveform ndim <= 2.")
236
+
237
+ if np.min(y) < -1.0:
238
+ pass
239
+ if np.max(y) > 1.0:
240
+ pass
241
+
242
+ mel_basis = librosa_style_mel_filterbank(
243
+ sr=sampling_rate,
244
+ n_fft=n_fft,
245
+ n_mels=num_mels,
246
+ fmin=float(fmin),
247
+ fmax=float(fmax) if fmax is not None else None,
248
+ norm="slaney",
249
+ ) # [num_mels, n_fft//2 + 1]
250
+
251
+ spec = _stft_magnitude(
252
+ y,
253
+ n_fft=n_fft,
254
+ hop_size=hop_size,
255
+ win_size=win_size,
256
+ center=center,
257
+ ) # [1, frames, freq]
258
+
259
+ mel_spec = np.matmul(mel_basis[None, :, :], np.transpose(spec, (0, 2, 1)))
260
+ mel_spec = np.transpose(mel_spec, (0, 1, 2)) # [1, num_mels, frames]
261
+
262
+ mel_spec = np.log(np.clip(mel_spec, a_min=clip_val, a_max=None)).astype(np.float32)
263
+ return mel_spec.transpose(0, 2, 1) # B x T x n_mels
test_qwen3-tts-streaming_onnx.py ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2026 Patrick Lumbantobing, Vertox-AI
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ """
16
+ End-to-end streaming TTS test script using ONNX Runtime.
17
+ This script demonstrates the full Qwen3-TTS-Realtime ONNX pipeline by:
18
+ 1. Loading six ONNX models (talker LLM, local talker transformer, codec decoder,
19
+ speaker encoder, talker codec embedding, text embedding projection)
20
+ into ONNX Runtime ``InferenceSession`` instances.
21
+ 2. Encoding a reference audio prompt for voice cloning.
22
+ 3. Simulating a streaming LLM text source (character-by-character deltas).
23
+ 4. Running the streaming TTS pipeline to produce audio chunks.
24
+ 5. Writing the concatenated audio to a WAV file.
25
+ Usage:
26
+ python test_qwen3-tts-streaming_onnx.py \
27
+ --talker_model_path qwen3-tts_onnx/talker_model.onnx \
28
+ --talker_local_model_path qwen3-tts_onnx/talker_local_model.onnx \
29
+ --codec_decoder_model_path qwen3-tts_onnx/codec_decoder_model.onnx \
30
+ --speaker_encoder_model_path qwen3-tts_onnx/speaker_encoder_model.onnx \
31
+ --talker_codec_embed_model_path qwen3-tts_onnx/talker_codec_embed_model.onnx \
32
+ --text_embed_proj_model_path qwen3-tts_onnx/text_embed_proj_model.onnx \
33
+ --model_config_path configs/config.json \
34
+ --codec_config_path configs/tokenizer_config.json \
35
+ --backbone_config_path configs/config_backbone.json \
36
+ --preprocessor_config_dir configs/ \
37
+ --temperature 0.85 \
38
+ --top_p 0.8 \
39
+ --top_k 50 \
40
+ --repetition_penalty 1.9 \
41
+ --repetition_window 50 \
42
+ --num_threads 4 \
43
+ --chunk_frames 4 \
44
+ --prompt_wav audio_ref/speaker.[wav|flac|mp3] \
45
+ --out_wav output.wav \
46
+ --text "Text to be synthesized" \
47
+ --language "english"
48
+ """
49
+
50
+ import argparse
51
+ import logging
52
+ import time
53
+ import wave
54
+ from pathlib import Path
55
+ from typing import Iterator
56
+
57
+ import numpy as np
58
+
59
+ from src.inference import Qwen3TTSInferencerONNX
60
+
61
+ logging.basicConfig(
62
+ level=logging.INFO,
63
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
64
+ datefmt="%Y-%m-%d %H:%M:%S",
65
+ )
66
+
67
+ DEFAULT_TALKER_MODEL_PATH = "qwen3-tts_onnx/talker_model.onnx"
68
+ DEFAULT_TALKER_LOCAL_MODEL_PATH = "qwen3-tts_onnx/talker_local_model.onnx"
69
+ DEFAULT_CODEC_DECODER_MODEL_PATH = "qwen3-tts_onnx/codec_decoder_model.onnx"
70
+ DEFAULT_SPEAKER_ENCODER_MODEL_PATH = "qwen3-tts_onnx/speaker_encoder_model.onnx"
71
+ DEFAULT_TALKER_CODEC_EMBED_MODEL_PATH = "qwen3-tts_onnx/talker_codec_embed_model.onnx"
72
+ DEFAULT_TEXT_EMBED_PROJ_MODEL_PATH = "qwen3-tts_onnx/text_embed_proj_model.onnx"
73
+
74
+ DEFAULT_PREPROCESSOR_CONFIG_DIR = "configs/"
75
+ DEFAULT_MODEL_CONFIG_PATH = "configs/config.json"
76
+ DEFAULT_CODEC_CONFIG_PATH = "configs/speech_tokenizer_config.json"
77
+
78
+ # DEFAULT_AUDIO_REF_PATH = "audio_ref/male_stewie.mp3"
79
+ # DEFAULT_AUDIO_REF_PATH = "audio_ref/male_petergriffin.wav"
80
+ # DEFAULT_AUDIO_REF_PATH = "audio_ref/male_old_movie.flac"
81
+ DEFAULT_AUDIO_REF_PATH = "audio_ref/female_shadowheart.flac"
82
+ # DEFAULT_AUDIO_REF_PATH = "audio_ref/david-attenborough.mp3"
83
+ # DEFAULT_AUDIO_REF_PATH = "audio_ref/rick-sanchez.mp3"
84
+
85
+ DEFAULT_OUT_WAV_DIR = "audio_synth/"
86
+ # DEFAULT_OUT_WAV_DIR = "/mnt/d/vertox/Qwen3-TTS/audio_synth/"
87
+
88
+ # DEFAULT_LANGUAGE = "english"
89
+ DEFAULT_LANGUAGE = "russian"
90
+
91
+ DEFAULT_TEMPERATURE = 0.85
92
+ DEFAULT_TOP_P = 0.8
93
+ DEFAULT_TOP_K = 50
94
+ DEFAULT_REPETITION_PENALTY = 1.9
95
+ DEFAULT_REPETITION_WINDOW = 50
96
+
97
+ # DEFAULT_TEXT = "A B"
98
+ # DEFAULT_TEXT = "Один, два"
99
+ # DEFAULT_TEXT = "Test 1 2."
100
+ # DEFAULT_TEXT = "Depending on the time, not only accuracy but also low-latency is important."
101
+ # DEFAULT_TEXT = "Depending on the time, not only accuracy but also low-latency is important. If it is not instant, then the human interaction is lost. We are finally reaching a moment where the technology is fast enough for people to simply communicate, and this is a huge shift for global business."
102
+ # DEFAULT_TEXT="в зависимости от времени не только точность, но и низкая задержка. Если это не мгновенно, то человеческое взаимодействие теряется. Мы наконец-то достигаем момента, когда технология достаточно быстра для того, чтобы люди просто общались, и это является огромным сдвигом для глобального бизнеса."
103
+ DEFAULT_TEXT = "в зависимости от времени не только точнос��ь, но и низкая задержка."
104
+ # DEFAULT_TEXT = "в зависимости от времени не только точность, но и низкая задержка. Если это не мгновенно, то человеческое взаимодействие теряется."
105
+ # DEFAULT_TEXT = "В зависимости от времени, важна не только точность, но и низкая задержка. Если это не происходит мгновенно, человеческое взаимодействие утрачивается. Мы наконец подходим к тому моменту, когда технологии становятся достаточно быстрыми. для того чтобы люди могли просто общаться — и это огромный сдвиг для мирового бизнеса."
106
+
107
+
108
+ def fake_llm_text_stream(
109
+ text: str,
110
+ chunk_chars: int = 1,
111
+ delay_s: float = 0.0,
112
+ ) -> Iterator[str]:
113
+ """
114
+ Simulate streaming text deltas from an LLM.
115
+ Each iteration yields `chunk_chars` characters with a delay of `delay_s` seconds.
116
+ In real-world usage, this can be replaced with streaming responses from models such as OpenAI or vLLM.
117
+ """
118
+ if not text:
119
+ return
120
+ step = max(1, chunk_chars)
121
+ for idx in range(0, len(text), step):
122
+ if delay_s > 0 and idx > 0:
123
+ time.sleep(delay_s)
124
+ yield text[idx : idx + step]
125
+
126
+
127
+ def write_wav(out_path: Path, sample_rate: int, chunks: Iterator[np.ndarray]) -> None:
128
+ all_chunks: list[np.ndarray] = []
129
+ for chunk in chunks:
130
+ all_chunks.append(chunk.astype(np.float32).reshape(-1))
131
+
132
+ if not all_chunks:
133
+ raise RuntimeError("No audio chunks produced.")
134
+
135
+ audio = np.concatenate(all_chunks)
136
+ # float32 → int16 PCM
137
+ audio = np.clip(audio, -1.0, 1.0)
138
+ pcm16 = (audio * 32767.0).astype(np.int16)
139
+
140
+ out_path.parent.mkdir(parents=True, exist_ok=True)
141
+ with wave.open(str(out_path), "wb") as wf:
142
+ wf.setnchannels(1)
143
+ wf.setsampwidth(2)
144
+ wf.setframerate(int(sample_rate))
145
+ wf.writeframes(pcm16.tobytes())
146
+
147
+
148
+ def decode_audio_frames(
149
+ audio_frames: list[np.ndarray],
150
+ inferencer: Qwen3TTSInferencerONNX,
151
+ ) -> Iterator[np.ndarray]:
152
+ for frame in audio_frames:
153
+ tokens = frame
154
+ if tokens.ndim == 3:
155
+ tokens = tokens[0]
156
+ if tokens.ndim != 2:
157
+ raise ValueError(f"Expected [T, C] audio tokens, got {tuple(tokens.shape)}")
158
+ logging.info(f"tokens {tokens} {tokens.shape}")
159
+ if tokens.size == 0:
160
+ continue
161
+ inferencer.push_tokens(tokens)
162
+ for wav in inferencer.audio_chunks():
163
+ if wav.size == 0:
164
+ continue
165
+ logging.info(f"decode_audio_frames wav {wav} {wav.shape}")
166
+ yield wav.reshape(-1)
167
+
168
+
169
+ def flush_decoder(inferencer: Qwen3TTSInferencerONNX) -> Iterator[np.ndarray]:
170
+ final_chunk = inferencer.flush()
171
+ if final_chunk is not None and final_chunk.size > 0:
172
+ logging.info(f"final_chunk flush {final_chunk} {final_chunk.shape}")
173
+ yield final_chunk.reshape(-1)
174
+
175
+
176
+ # Core: Streaming generation: text delta → push_text → audio
177
+ def run_streaming_tts(
178
+ inferencer: Qwen3TTSInferencerONNX,
179
+ text_deltas: Iterator[str],
180
+ ) -> Iterator[np.ndarray]:
181
+ """
182
+ Receives streaming text deltas, feeds them into the TTS via `session.push_text()`,
183
+ and produces playable WAV chunks in real time.
184
+
185
+ The pipeline matches the Gradio demo:
186
+ codec.streaming → push_text → decode_frames → end_text → drain → flush
187
+
188
+ Args:
189
+ session: A streaming session that has been initialized and `reset_turn` has been called.
190
+ codec: The audio codec (used for streaming context).
191
+ decoder: An `AudioStreamDecoder` instance.
192
+ text_deltas: An iterator of text deltas (simulating LLM streaming output).
193
+ """
194
+ for delta in text_deltas:
195
+ logging.info(f"delta {delta}")
196
+ audio_frames = inferencer.push_text(delta)
197
+ if len(audio_frames) > 0:
198
+ logging.info(f"audio_frames {audio_frames} {len(audio_frames)}")
199
+ for audio_frame in audio_frames:
200
+ logging.info(f"audio_frame {audio_frame} {audio_frame.shape}")
201
+ yield from decode_audio_frames(audio_frames, inferencer)
202
+
203
+ audio_frames = inferencer.end_text()
204
+ if len(audio_frames) > 0:
205
+ logging.info(f"audio_frames end_text {audio_frames} {len(audio_frames)}")
206
+ for audio_frame in audio_frames:
207
+ logging.info(f"audio_frame end_text {audio_frame} {audio_frame.shape}")
208
+ yield from decode_audio_frames(audio_frames, inferencer)
209
+
210
+ while True:
211
+ audio_frames = inferencer.drain(max_steps=1)
212
+ if len(audio_frames) > 0:
213
+ logging.info(f"audio_frames drain {audio_frames} {len(audio_frames)}")
214
+ for audio_frame in audio_frames:
215
+ logging.info(f"audio_frame drain {audio_frame} {audio_frame.shape}")
216
+ if not audio_frames:
217
+ break
218
+ yield from decode_audio_frames(audio_frames, inferencer)
219
+ if inferencer.is_finished:
220
+ break
221
+
222
+ yield from flush_decoder(inferencer)
223
+
224
+
225
+ def main():
226
+ p = argparse.ArgumentParser(description="Simulated LLM streaming text → TTS streaming audio。")
227
+ p.add_argument("--talker_model_path", type=str, default=DEFAULT_TALKER_MODEL_PATH)
228
+ p.add_argument("--talker_local_model_path", type=str, default=DEFAULT_TALKER_LOCAL_MODEL_PATH)
229
+ p.add_argument("--codec_decoder_model_path", type=str, default=DEFAULT_CODEC_DECODER_MODEL_PATH)
230
+ p.add_argument("--speaker_encoder_model_path", type=str, default=DEFAULT_SPEAKER_ENCODER_MODEL_PATH)
231
+ p.add_argument("--talker_codec_embed_model_path", type=str, default=DEFAULT_TALKER_CODEC_EMBED_MODEL_PATH)
232
+ p.add_argument("--text_embed_proj_model_path", type=str, default=DEFAULT_TEXT_EMBED_PROJ_MODEL_PATH)
233
+ p.add_argument("--model_config_path", type=str, default=DEFAULT_MODEL_CONFIG_PATH)
234
+ p.add_argument("--codec_config_path", type=str, default=DEFAULT_CODEC_CONFIG_PATH)
235
+ p.add_argument("--preprocessor_config_dir", type=str, default=DEFAULT_PREPROCESSOR_CONFIG_DIR)
236
+ p.add_argument("--temperature", type=float, default=DEFAULT_TEMPERATURE)
237
+ p.add_argument("--top_p", type=float, default=DEFAULT_TOP_P)
238
+ p.add_argument("--top_k", type=int, default=DEFAULT_TOP_K)
239
+ p.add_argument("--repetition_penalty", type=float, default=DEFAULT_REPETITION_PENALTY)
240
+ p.add_argument("--repetition_window", type=int, default=DEFAULT_REPETITION_WINDOW)
241
+ # 模拟 LLM streaming 参数
242
+ p.add_argument(
243
+ "--delta_chunk_chars", type=int, default=1, help="Number of characters to output at each delta (1 = verbatim)"
244
+ )
245
+ p.add_argument(
246
+ "--delta_delay_s", type=float, default=0.0, help="Simulated delay in seconds between deltas, let 0 = no delay"
247
+ )
248
+ p.add_argument("--num_threads", type=int, default=4, help="Number of threads used for sess.intra_num_op_threads")
249
+ p.add_argument(
250
+ "--chunk_frames",
251
+ type=int,
252
+ default=4,
253
+ help="Number of chunk frames for codec decoder forward [default: 4 frames (0.32 s)]",
254
+ )
255
+ p.add_argument("--prompt_wav", type=str, default=DEFAULT_AUDIO_REF_PATH)
256
+ p.add_argument("--out_wav", type=str, default=None)
257
+ p.add_argument(
258
+ "--text",
259
+ type=str,
260
+ default=DEFAULT_TEXT,
261
+ )
262
+ p.add_argument(
263
+ "--language",
264
+ type=str,
265
+ default=DEFAULT_LANGUAGE,
266
+ )
267
+
268
+ args = p.parse_args()
269
+ inferencer = Qwen3TTSInferencerONNX(
270
+ talker_model_path=args.talker_model_path,
271
+ talker_local_model_path=args.talker_local_model_path,
272
+ codec_decoder_model_path=args.codec_decoder_model_path,
273
+ speaker_encoder_model_path=args.speaker_encoder_model_path,
274
+ talker_codec_embed_model_path=args.talker_codec_embed_model_path,
275
+ text_embed_proj_model_path=args.text_embed_proj_model_path,
276
+ preprocessor_config_dir=args.preprocessor_config_dir,
277
+ model_config_path=args.model_config_path,
278
+ codec_config_path=args.codec_config_path,
279
+ audio_ref_path=args.prompt_wav,
280
+ language=args.language,
281
+ num_threads=args.num_threads,
282
+ chunk_frames=args.chunk_frames,
283
+ temperature=args.temperature,
284
+ top_p=args.top_p,
285
+ top_k=args.top_k,
286
+ repetition_penalty=args.repetition_penalty,
287
+ repetition_window=args.repetition_window,
288
+ )
289
+ logging.info("Inferencer loaded.")
290
+ logging.info(inferencer)
291
+
292
+ inferencer.reset_turn(reset_cache=True)
293
+ logging.info("State initialized.")
294
+
295
+ text_deltas = fake_llm_text_stream(
296
+ args.text,
297
+ chunk_chars=args.delta_chunk_chars,
298
+ delay_s=args.delta_delay_s,
299
+ )
300
+
301
+ logging.info("Running streaming tts simulation...")
302
+ wav_chunks = run_streaming_tts(
303
+ inferencer=inferencer,
304
+ text_deltas=text_deltas,
305
+ )
306
+ logging.info("Done.")
307
+
308
+ if args.out_wav is None:
309
+ out_wav_dir = Path(DEFAULT_OUT_WAV_DIR).expanduser()
310
+ out_wav_dir.mkdir(parents=True, exist_ok=True)
311
+ out_wav_path = out_wav_dir / f"output_{time.time()}.wav"
312
+ else:
313
+ out_wav_path = Path(args.out_wav).expanduser()
314
+ out_wav_path.parent.mkdir(parents=True, exist_ok=True)
315
+
316
+ write_wav(out_wav_path, inferencer.output_sample_rate, wav_chunks)
317
+ # This write_wav should be substituted with streaming playback device
318
+ logging.info(f"\n[OK] Write complete: {out_wav_path}")
319
+
320
+
321
+ if __name__ == "__main__":
322
+ main()