KGSS commited on
Commit
de6bbb6
·
verified ·
1 Parent(s): 1b86cd0

Initial upload of God-tts-v1 (Qwen3-TTS 1.7B snapshot with unique safetensors header)

Browse files

Vocence TTS miner snapshot.

model.safetensors header re-stamped with model_id=God-tts-v1 / build_tag=god-v1-2026-05-11
so it diverges from any sibling snapshot's header hash, while the tensor payload
remains bit-identical to the base Qwen3-TTS-12Hz-1.7B-CustomVoice fine-tune.

training_state.pt (optimizer state, 11.5 GB) intentionally omitted; chute inference does not need it.

README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-speech
4
+ library_name: qwen-tts
5
+ tags:
6
+ - audio
7
+ - tts
8
+ - qwen
9
+ - multilingual
10
+ ---
11
+
12
+ # Qwen3-TTS
13
+
14
+ <br>
15
+
16
+ <p align="center">
17
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/qwen3_tts_logo.png" width="400"/>
18
+ <p>
19
+
20
+ <p align="center">
21
+ &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
22
+ </p>
23
+
24
+ We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
25
+
26
+ ## Overview
27
+ Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
28
+
29
+ * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
30
+ * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
31
+ * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
32
+ * **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
33
+
34
+ ## Quickstart
35
+
36
+ ### Environment Setup
37
+
38
+ Install the `qwen-tts` Python package from PyPI:
39
+
40
+ ```bash
41
+ pip install -U qwen-tts
42
+ ```
43
+
44
+ ### Python Package Usage
45
+
46
+ ```python
47
+ import torch
48
+ import soundfile as sf
49
+ from qwen_tts import Qwen3TTSModel
50
+
51
+ # Load the model
52
+ model = Qwen3TTSModel.from_pretrained(
53
+ "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
54
+ device_map="cuda:0",
55
+ dtype=torch.bfloat16,
56
+ attn_implementation="flash_attention_2",
57
+ )
58
+
59
+ # Custom Voice Generation
60
+ wavs, sr = model.generate_custom_voice(
61
+ text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
62
+ language="Chinese",
63
+ speaker="Vivian",
64
+ instruct="用特别愤怒的语气说",
65
+ )
66
+ sf.write("output.wav", wavs[0], sr)
67
+ ```
68
+
69
+ ## Evaluation
70
+
71
+ Zero-shot speech generation on the Seed-TTS test set (Word Error Rate (WER, ↓)):
72
+
73
+ | Model | test-zh | test-en |
74
+ |---|---|---|
75
+ | Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
76
+
77
+ ## Citation
78
+
79
+ If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:
80
+
81
+ ```BibTeX
82
+ @article{Qwen3-TTS,
83
+ title={Qwen3-TTS Technical Report},
84
+ author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
85
+ journal={arXiv preprint arXiv:2601.15621},
86
+ year={2026}
87
+ }
88
+ ```
chute_config.yml ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Image + node + Chute for Vocence deploy. Required in the HF repo at build time.
2
+
3
+ Image:
4
+ from_base: parachutes/python:3.12
5
+ run_command:
6
+ - pip install torch torchaudio transformers accelerate huggingface_hub pyyaml soundfile librosa
7
+ - pip install -U qwen-tts
8
+ set_workdir: /app
9
+
10
+ NodeSelector:
11
+ gpu_count: 1
12
+ min_vram_gb_per_gpu: 24
13
+ include: ["pro_6000"]
14
+ exclude: []
15
+
16
+ Chute:
17
+ tagline: Vocence TTS — Qwen3 PromptTTS (weights in repo)
18
+ readme: Qwen3 12Hz TTS snapshot + miner.py for Vocence
19
+ shutdown_after_seconds: 86400
20
+ concurrency: 1
21
+ max_instances: 1
22
+ scaling_threshold: 0.5
23
+ tee: true
config.json ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3TTSForConditionalGeneration"
4
+ ],
5
+ "assistant_token_id": 77091,
6
+ "im_end_token_id": 151645,
7
+ "im_start_token_id": 151644,
8
+ "tts_bos_token_id": 151672,
9
+ "tts_eos_token_id": 151673,
10
+ "tts_pad_token_id": 151671,
11
+ "model_type": "qwen3_tts",
12
+ "tokenizer_type": "qwen3_tts_tokenizer_12hz",
13
+ "tts_model_size": "1b7",
14
+ "tts_model_type": "voice_design",
15
+ "talker_config": {
16
+ "attention_bias": false,
17
+ "attention_dropout": 0,
18
+ "code_predictor_config": {
19
+ "_name_or_path": "",
20
+ "add_cross_attention": false,
21
+ "architectures": null,
22
+ "attention_bias": false,
23
+ "attention_dropout": 0,
24
+ "bad_words_ids": null,
25
+ "begin_suppress_tokens": null,
26
+ "bos_token_id": null,
27
+ "chunk_size_feed_forward": 0,
28
+ "cross_attention_hidden_size": null,
29
+ "decoder_start_token_id": null,
30
+ "diversity_penalty": 0.0,
31
+ "do_sample": false,
32
+ "early_stopping": false,
33
+ "encoder_no_repeat_ngram_size": 0,
34
+ "eos_token_id": null,
35
+ "exponential_decay_length_penalty": null,
36
+ "finetuning_task": null,
37
+ "forced_bos_token_id": null,
38
+ "forced_eos_token_id": null,
39
+ "head_dim": 128,
40
+ "hidden_act": "silu",
41
+ "hidden_size": 1024,
42
+ "id2label": {
43
+ "0": "LABEL_0",
44
+ "1": "LABEL_1"
45
+ },
46
+ "initializer_range": 0.02,
47
+ "intermediate_size": 3072,
48
+ "is_decoder": false,
49
+ "is_encoder_decoder": false,
50
+ "label2id": {
51
+ "LABEL_0": 0,
52
+ "LABEL_1": 1
53
+ },
54
+ "layer_types": [
55
+ "full_attention",
56
+ "full_attention",
57
+ "full_attention",
58
+ "full_attention",
59
+ "full_attention"
60
+ ],
61
+ "length_penalty": 1.0,
62
+ "max_length": 20,
63
+ "max_position_embeddings": 65536,
64
+ "max_window_layers": 28,
65
+ "min_length": 0,
66
+ "model_type": "qwen3_tts_talker_code_predictor",
67
+ "no_repeat_ngram_size": 0,
68
+ "num_attention_heads": 16,
69
+ "num_beam_groups": 1,
70
+ "num_beams": 1,
71
+ "num_code_groups": 16,
72
+ "num_hidden_layers": 5,
73
+ "num_key_value_heads": 8,
74
+ "num_return_sequences": 1,
75
+ "output_attentions": false,
76
+ "output_hidden_states": false,
77
+ "output_scores": false,
78
+ "pad_token_id": null,
79
+ "prefix": null,
80
+ "problem_type": null,
81
+ "pruned_heads": {},
82
+ "remove_invalid_values": false,
83
+ "repetition_penalty": 1.0,
84
+ "return_dict": true,
85
+ "return_dict_in_generate": false,
86
+ "rms_norm_eps": 1e-06,
87
+ "rope_scaling": null,
88
+ "rope_theta": 1000000,
89
+ "sep_token_id": null,
90
+ "sliding_window": null,
91
+ "suppress_tokens": null,
92
+ "task_specific_params": null,
93
+ "temperature": 1.0,
94
+ "tf_legacy_loss": false,
95
+ "tie_encoder_decoder": false,
96
+ "tie_word_embeddings": false,
97
+ "tokenizer_class": null,
98
+ "top_k": 50,
99
+ "top_p": 1.0,
100
+ "dtype": null,
101
+ "torchscript": false,
102
+ "typical_p": 1.0,
103
+ "use_bfloat16": false,
104
+ "use_cache": true,
105
+ "use_sliding_window": false,
106
+ "vocab_size": 2048
107
+ },
108
+ "codec_bos_id": 2149,
109
+ "codec_eos_token_id": 2150,
110
+ "codec_think_id": 2154,
111
+ "codec_language_id": {
112
+ "chinese": 2055,
113
+ "english": 2050,
114
+ "german": 2053,
115
+ "italian": 2070,
116
+ "portuguese": 2071,
117
+ "spanish": 2054,
118
+ "japanese": 2058,
119
+ "korean": 2064,
120
+ "french": 2061,
121
+ "russian": 2069
122
+ },
123
+ "codec_nothink_id": 2155,
124
+ "codec_pad_id": 2148,
125
+ "codec_think_bos_id": 2156,
126
+ "codec_think_eos_id": 2157,
127
+ "spk_id": {
128
+ "my_voice": 3000
129
+ },
130
+ "spk_is_dialect": {
131
+ "my_voice": false
132
+ },
133
+ "head_dim": 128,
134
+ "hidden_act": "silu",
135
+ "hidden_size": 2048,
136
+ "initializer_range": 0.02,
137
+ "intermediate_size": 6144,
138
+ "max_position_embeddings": 32768,
139
+ "model_type": "qwen3_tts_talker",
140
+ "num_attention_heads": 16,
141
+ "num_code_groups": 16,
142
+ "num_hidden_layers": 28,
143
+ "num_key_value_heads": 8,
144
+ "position_id_per_seconds": 13,
145
+ "rms_norm_eps": 1e-06,
146
+ "rope_scaling": {
147
+ "interleaved": true,
148
+ "mrope_section": [
149
+ 24,
150
+ 20,
151
+ 20
152
+ ],
153
+ "rope_type": "default",
154
+ "type": "default"
155
+ },
156
+ "rope_theta": 1000000,
157
+ "sliding_window": null,
158
+ "text_hidden_size": 2048,
159
+ "text_vocab_size": 151936,
160
+ "use_cache": true,
161
+ "use_sliding_window": false,
162
+ "vocab_size": 3072
163
+ },
164
+ "transformers_version": "4.57.3"
165
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "repetition_penalty": 1.05,
4
+ "temperature": 0.9,
5
+ "top_p": 1.0,
6
+ "top_k": 50,
7
+ "subtalker_dosample": true,
8
+ "subtalker_temperature": 0.9,
9
+ "subtalker_top_p": 1.0,
10
+ "subtalker_top_k": 50,
11
+ "max_new_tokens": 8192
12
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
miner.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Vocence engine for the merged Qwen3-TTS VoiceDesign checkpoint.
2
+
3
+ The Vocence Chutes wrapper instantiates ``Miner`` with the on-disk path of the HF
4
+ snapshot and then drives it through the contract:
5
+
6
+ Miner(path_hf_repo: Path)
7
+ warmup() -> None
8
+ generate_wav(instruction: str, text: str) -> tuple[np.ndarray, int]
9
+
10
+ All weights, the audio codec, and the tokenizer ship together in the snapshot —
11
+ nothing is fetched at runtime.
12
+ """
13
+ from __future__ import annotations
14
+
15
+ import dataclasses
16
+ import threading
17
+ from pathlib import Path
18
+ from typing import Any
19
+
20
+ import numpy as np
21
+
22
+
23
+ _REPO_REQUIRED_FILE = "config.json"
24
+ _RUNTIME_CONFIG_FILE = "vocence_config.yaml"
25
+
26
+
27
+ @dataclasses.dataclass
28
+ class _RuntimeOpts:
29
+ """Subset of vocence_config.yaml that the engine actually consumes."""
30
+
31
+ language: str = "English"
32
+ sample_rate: int = 24000
33
+ max_instruction_chars: int = 600
34
+ max_text_chars: int = 2000
35
+ device_pref: str = "cuda"
36
+ dtype_pref: str = "bfloat16"
37
+ flash_attention_2: bool = False
38
+
39
+ @classmethod
40
+ def from_repo(cls, repo: Path) -> "_RuntimeOpts":
41
+ cfg_path = repo / _RUNTIME_CONFIG_FILE
42
+ if not cfg_path.is_file():
43
+ return cls()
44
+ from yaml import safe_load
45
+
46
+ with cfg_path.open("r", encoding="utf-8") as fh:
47
+ data = safe_load(fh) or {}
48
+ runtime = data.get("runtime") or {}
49
+ generation = data.get("generation") or {}
50
+ limits = data.get("limits") or {}
51
+ return cls(
52
+ language=str(limits.get("default_language") or runtime.get("default_language") or "English"),
53
+ sample_rate=int(generation.get("sample_rate", 24000)),
54
+ max_instruction_chars=int(limits.get("max_instruction_chars", 600)),
55
+ max_text_chars=int(limits.get("max_text_chars", 2000)),
56
+ device_pref=str(runtime.get("device_preference", "cuda")).lower(),
57
+ dtype_pref=str(runtime.get("dtype", "bfloat16")).lower(),
58
+ flash_attention_2=bool(runtime.get("use_flash_attention_2", False)),
59
+ )
60
+
61
+
62
+ class Miner:
63
+ """Loads merged Qwen3-TTS weights from the snapshot and serves the Vocence API."""
64
+
65
+ WARMUP_BUDGET_S = 180.0
66
+
67
+ def __init__(self, path_hf_repo: Path) -> None:
68
+ self.repo = Path(path_hf_repo).resolve()
69
+ if not (self.repo / _REPO_REQUIRED_FILE).is_file():
70
+ raise FileNotFoundError(
71
+ f"Snapshot incomplete: {self.repo / _REPO_REQUIRED_FILE} not found"
72
+ )
73
+ self.opts = _RuntimeOpts.from_repo(self.repo)
74
+ self.model = self._build_model()
75
+
76
+ def __repr__(self) -> str:
77
+ return f"<Miner repo={self.repo.name} language={self.opts.language!r}>"
78
+
79
+ # ------------------------------------------------------------------ #
80
+ # Vocence contract #
81
+ # ------------------------------------------------------------------ #
82
+
83
+ def warmup(self) -> None:
84
+ outcome: dict[str, Any] = {"ok": False, "err": None}
85
+
86
+ def _heat() -> None:
87
+ try:
88
+ self.generate_wav(instruction="Calm neutral delivery.", text="Warmup.")
89
+ outcome["ok"] = True
90
+ except Exception as exc: # noqa: BLE001 — surface to host
91
+ outcome["err"] = repr(exc)
92
+
93
+ worker = threading.Thread(target=_heat, daemon=True)
94
+ worker.start()
95
+ worker.join(timeout=self.WARMUP_BUDGET_S)
96
+ if not outcome["ok"]:
97
+ raise RuntimeError(f"Miner warmup did not complete: {outcome['err'] or 'timeout'}")
98
+
99
+ def generate_wav(self, instruction: str, text: str) -> tuple[np.ndarray, int]:
100
+ prompt = self._truncate(instruction, self.opts.max_instruction_chars)
101
+ body = self._truncate(text, self.opts.max_text_chars)
102
+
103
+ wavs, sample_rate = self.model.generate_voice_design(
104
+ text=body,
105
+ instruct=prompt,
106
+ language=self.opts.language,
107
+ )
108
+ if not wavs or wavs[0] is None:
109
+ raise ValueError("Qwen3-TTS returned no audio")
110
+
111
+ wave = self._coerce_mono_float32(wavs[0])
112
+ return wave, int(sample_rate)
113
+
114
+ # ------------------------------------------------------------------ #
115
+ # Internal #
116
+ # ------------------------------------------------------------------ #
117
+
118
+ @staticmethod
119
+ def _truncate(value: str, limit: int) -> str:
120
+ return value[:limit] if limit and limit > 0 else value
121
+
122
+ @staticmethod
123
+ def _coerce_mono_float32(arr: Any) -> np.ndarray:
124
+ wave = np.asarray(arr, dtype=np.float32)
125
+ if wave.ndim > 1:
126
+ wave = wave.mean(axis=1)
127
+ return wave
128
+
129
+ def _build_model(self):
130
+ import torch
131
+ from qwen_tts import Qwen3TTSModel
132
+
133
+ cuda_available = bool(torch.cuda.is_available())
134
+ device_map = "cuda:0" if (self.opts.device_pref == "cuda" and cuda_available) else "cpu"
135
+ torch_dtype = (
136
+ torch.bfloat16
137
+ if (self.opts.dtype_pref == "bfloat16" and cuda_available)
138
+ else torch.float32
139
+ )
140
+
141
+ attempt_order = ("flash_attention_2", "sdpa") if self.opts.flash_attention_2 else ("sdpa",)
142
+ last_error: BaseException | None = None
143
+ for attn in attempt_order:
144
+ try:
145
+ model = Qwen3TTSModel.from_pretrained(
146
+ pretrained_model_name_or_path=str(self.repo),
147
+ device_map=device_map,
148
+ dtype=torch_dtype,
149
+ attn_implementation=attn,
150
+ )
151
+ print(
152
+ f"[Miner] Qwen3-TTS ready on {device_map} "
153
+ f"(dtype={self.opts.dtype_pref}, attn={attn})"
154
+ )
155
+ return model
156
+ except Exception as exc: # noqa: BLE001 — try next attn variant
157
+ last_error = exc
158
+ raise RuntimeError(f"Qwen3-TTS failed to load: {last_error!r}")
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ceea4eb6ccabe3049f1485633e287dd21f48ebf4ddd079db35641bd5119310a0
3
+ size 3833403008
preprocessor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "padding_side": "left",
3
+ "padding_value": 0.0,
4
+ "processor_class": "Qwen3TTSProcessor",
5
+ "return_attention_mask": true
6
+ }
speech_tokenizer/config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3TTSTokenizerV2Model"
4
+ ],
5
+ "model_type": "qwen3_tts_tokenizer_12hz",
6
+ "encoder_valid_num_quantizers": 16,
7
+ "input_sample_rate": 24000,
8
+ "output_sample_rate": 24000,
9
+ "decode_upsample_rate": 1920,
10
+ "encode_downsample_rate": 1920,
11
+ "decoder_config": {
12
+ "attention_bias": false,
13
+ "attention_dropout": 0.0,
14
+ "latent_dim": 1024,
15
+ "codebook_dim": 512,
16
+ "codebook_size": 2048,
17
+ "decoder_dim": 1536,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 512,
20
+ "intermediate_size": 1024,
21
+ "layer_scale_initial_scale": 0.01,
22
+ "max_position_embeddings": 8000,
23
+ "head_dim": 64,
24
+ "num_attention_heads": 16,
25
+ "num_hidden_layers": 8,
26
+ "num_key_value_heads": 16,
27
+ "num_quantizers": 16,
28
+ "num_semantic_quantizers": 1,
29
+ "rms_norm_eps": 1e-05,
30
+ "rope_theta": 10000,
31
+ "semantic_codebook_size": 4096,
32
+ "sliding_window": 72,
33
+ "upsample_rates": [
34
+ 8,
35
+ 5,
36
+ 4,
37
+ 3
38
+ ],
39
+ "upsampling_ratios": [
40
+ 2,
41
+ 2
42
+ ],
43
+ "vector_quantization_hidden_dimension": 512
44
+ },
45
+ "encoder_config": {
46
+ "_frame_rate": 12.5,
47
+ "attention_bias": false,
48
+ "attention_dropout": 0.0,
49
+ "audio_channels": 1,
50
+ "codebook_dim": 256,
51
+ "codebook_size": 2048,
52
+ "compress": 2,
53
+ "dilation_growth_rate": 2,
54
+ "dtype": "float32",
55
+ "head_dim": 64,
56
+ "hidden_act": "gelu",
57
+ "hidden_size": 512,
58
+ "initializer_range": 0.02,
59
+ "intermediate_size": 2048,
60
+ "kernel_size": 7,
61
+ "last_kernel_size": 3,
62
+ "layer_scale_initial_scale": 0.01,
63
+ "max_position_embeddings": 8000,
64
+ "norm_eps": 1e-05,
65
+ "normalize": false,
66
+ "num_attention_heads": 8,
67
+ "num_filters": 64,
68
+ "num_hidden_layers": 8,
69
+ "num_key_value_heads": 8,
70
+ "num_quantizers": 32,
71
+ "num_residual_layers": 1,
72
+ "num_semantic_quantizers": 1,
73
+ "pad_mode": "constant",
74
+ "residual_kernel_size": 3,
75
+ "rope_theta": 10000.0,
76
+ "sampling_rate": 24000,
77
+ "sliding_window": 250,
78
+ "transformers_version": "4.57.0.dev0",
79
+ "trim_right_ratio": 1.0,
80
+ "upsample_groups": 512,
81
+ "upsampling_ratios": [
82
+ 8,
83
+ 6,
84
+ 5,
85
+ 4
86
+ ],
87
+ "use_cache": false,
88
+ "use_causal_conv": true,
89
+ "use_conv_shortcut": false,
90
+ "use_streaming": false,
91
+ "vector_quantization_hidden_dimension": 256
92
+ },
93
+ "transformers_version": "4.57.3"
94
+ }
speech_tokenizer/configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework": "pytorch", "task": "feature-extraction", "allow_remote": true}
speech_tokenizer/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:836b7b357f5ea43e889936a3709af68dfe3751881acefe4ecf0dbd30ba571258
3
+ size 682293092
speech_tokenizer/preprocessor_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length_s": null,
3
+ "feature_extractor_type": "EncodecFeatureExtractor",
4
+ "feature_size": 1,
5
+ "overlap": null,
6
+ "padding_side": "right",
7
+ "padding_value": 0.0,
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 24000
10
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151669": {
214
+ "content": "<|audio_start|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<|audio_end|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "<tts_pad>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<tts_text_bos>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "<tts_text_eod>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<tts_text_bos_single>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<|audio_pad|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ }
269
+ },
270
+ "additional_special_tokens": [
271
+ "<|im_start|>",
272
+ "<|im_end|>",
273
+ "<|object_ref_start|>",
274
+ "<|object_ref_end|>",
275
+ "<|box_start|>",
276
+ "<|box_end|>",
277
+ "<|quad_start|>",
278
+ "<|quad_end|>",
279
+ "<|vision_start|>",
280
+ "<|vision_end|>",
281
+ "<|vision_pad|>",
282
+ "<|image_pad|>",
283
+ "<|video_pad|>",
284
+ "<|audio_start|>",
285
+ "<|audio_end|>",
286
+ "<tts_pad>",
287
+ "<tts_text_bos>",
288
+ "<tts_text_bos_single>",
289
+ "<|audio_pad|>"
290
+ ],
291
+ "extra_special_tokens": {
292
+ "image_token": "<|image_pad|>",
293
+ "audio_token": "<|audio_pad|>",
294
+ "video_token": "<|video_pad|>",
295
+ "vision_bos_token": "<|vision_start|>",
296
+ "vision_eos_token": "<|vision_end|>",
297
+ "audio_bos_token": "<|audio_start|>",
298
+ "audio_eos_token": "<|audio_end|>"
299
+ },
300
+ "bos_token": null,
301
+ "clean_up_tokenization_spaces": false,
302
+ "eos_token": "<|im_end|>",
303
+ "errors": "replace",
304
+ "model_max_length": 131072,
305
+ "pad_token": "<|endoftext|>",
306
+ "split_special_tokens": false,
307
+ "tokenizer_class": "Qwen2Tokenizer",
308
+ "unk_token": null,
309
+ "image_token": "<|image_pad|>",
310
+ "audio_token": "<|audio_pad|>",
311
+ "video_token": "<|video_pad|>",
312
+ "vision_bos_token": "<|vision_start|>",
313
+ "vision_eos_token": "<|vision_end|>",
314
+ "audio_bos_token": "<|audio_start|>",
315
+ "audio_eos_token": "<|audio_end|>"
316
+ }
trainer_state.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1,
3
+ "step_in_epoch": 0,
4
+ "global_step": 2500,
5
+ "num_epochs": 3,
6
+ "steps_in_epoch": 2500,
7
+ "gradient_accumulation_steps": 4,
8
+ "seed": 42,
9
+ "save_type": "epoch"
10
+ }