michael-chan-000 commited on
Commit
e3abe87
·
verified ·
1 Parent(s): b26c50a

Upload model

Browse files
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-speech
4
+ library_name: qwen-tts
5
+ tags:
6
+ - audio
7
+ - tts
8
+ - qwen
9
+ - multilingual
10
+ ---
11
+
12
+ # Qwen3-TTS
13
+
14
+ <br>
15
+
16
+ <p align="center">
17
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/qwen3_tts_logo.png" width="400"/>
18
+ <p>
19
+
20
+ <p align="center">
21
+ &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
22
+ </p>
23
+
24
+ We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
25
+
26
+ ## Overview
27
+ Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
28
+
29
+ * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
30
+ * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
31
+ * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
32
+ * **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
33
+
34
+ ## Quickstart
35
+
36
+ ### Environment Setup
37
+
38
+ Install the `qwen-tts` Python package from PyPI:
39
+
40
+ ```bash
41
+ pip install -U qwen-tts
42
+ ```
43
+
44
+ ### Python Package Usage
45
+
46
+ ```python
47
+ import torch
48
+ import soundfile as sf
49
+ from qwen_tts import Qwen3TTSModel
50
+
51
+ # Load the model
52
+ model = Qwen3TTSModel.from_pretrained(
53
+ "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
54
+ device_map="cuda:0",
55
+ dtype=torch.bfloat16,
56
+ attn_implementation="flash_attention_2",
57
+ )
58
+
59
+ # Custom Voice Generation
60
+ wavs, sr = model.generate_custom_voice(
61
+ text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
62
+ language="Chinese",
63
+ speaker="Vivian",
64
+ instruct="用特别愤怒的语气说",
65
+ )
66
+ sf.write("output.wav", wavs[0], sr)
67
+ ```
68
+
69
+ ## Evaluation
70
+
71
+ Zero-shot speech generation on the Seed-TTS test set (Word Error Rate (WER, ↓)):
72
+
73
+ | Model | test-zh | test-en |
74
+ |---|---|---|
75
+ | Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
76
+
77
+ ## Citation
78
+
79
+ If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:
80
+
81
+ ```BibTeX
82
+ @article{Qwen3-TTS,
83
+ title={Qwen3-TTS Technical Report},
84
+ author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
85
+ journal={arXiv preprint arXiv:2601.15621},
86
+ year={2026}
87
+ }
88
+ ```
__pycache__/miner.cpython-312.pyc ADDED
Binary file (7.2 kB). View file
 
chute_config.yml ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Image + node + Chute for Vocence deploy. Required in the HF repo at build fred time.
2
+
3
+ Image:
4
+ from_base: parachutes/python:3.12
5
+ run_command:
6
+ - pip install torch torchaudio transformers accelerate huggingface_hub pyyaml soundfile librosa
7
+ - pip install -U qwen-tts
8
+ set_workdir: /app
9
+
10
+ NodeSelector:
11
+ gpu_count: 1
12
+ min_vram_gb_per_gpu: 24
13
+ include: ["pro_6000"]
14
+ exclude: []
15
+
16
+ Chute:
17
+ tagline: Vocence TTS — Qwen3 PromptTTS (weights in repo)
18
+ readme: Qwen3 12Hz TTS snapshot + miner.py for Vocence
19
+ shutdown_after_seconds: 86400
20
+ concurrency: 1
21
+ max_instances: 1
22
+ scaling_threshold: 0.5
23
+ tee: true
config.json ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3TTSForConditionalGeneration"
4
+ ],
5
+ "assistant_token_id": 77091,
6
+ "im_end_token_id": 151645,
7
+ "im_start_token_id": 151644,
8
+ "tts_bos_token_id": 151672,
9
+ "tts_eos_token_id": 151673,
10
+ "tts_pad_token_id": 151671,
11
+ "model_type": "qwen3_tts",
12
+ "tokenizer_type": "qwen3_tts_tokenizer_12hz",
13
+ "tts_model_size": "1b7",
14
+ "tts_model_type": "voice_design",
15
+ "talker_config": {
16
+ "attention_bias": false,
17
+ "attention_dropout": 0,
18
+ "code_predictor_config": {
19
+ "_name_or_path": "",
20
+ "add_cross_attention": false,
21
+ "architectures": null,
22
+ "attention_bias": false,
23
+ "attention_dropout": 0,
24
+ "bad_words_ids": null,
25
+ "begin_suppress_tokens": null,
26
+ "bos_token_id": null,
27
+ "chunk_size_feed_forward": 0,
28
+ "cross_attention_hidden_size": null,
29
+ "decoder_start_token_id": null,
30
+ "diversity_penalty": 0.0,
31
+ "do_sample": false,
32
+ "early_stopping": false,
33
+ "encoder_no_repeat_ngram_size": 0,
34
+ "eos_token_id": null,
35
+ "exponential_decay_length_penalty": null,
36
+ "finetuning_task": null,
37
+ "forced_bos_token_id": null,
38
+ "forced_eos_token_id": null,
39
+ "head_dim": 128,
40
+ "hidden_act": "silu",
41
+ "hidden_size": 1024,
42
+ "id2label": {
43
+ "0": "LABEL_0",
44
+ "1": "LABEL_1"
45
+ },
46
+ "initializer_range": 0.02,
47
+ "intermediate_size": 3072,
48
+ "is_decoder": false,
49
+ "is_encoder_decoder": false,
50
+ "label2id": {
51
+ "LABEL_0": 0,
52
+ "LABEL_1": 1
53
+ },
54
+ "layer_types": [
55
+ "full_attention",
56
+ "full_attention",
57
+ "full_attention",
58
+ "full_attention",
59
+ "full_attention"
60
+ ],
61
+ "length_penalty": 1.0,
62
+ "max_length": 20,
63
+ "max_position_embeddings": 65536,
64
+ "max_window_layers": 28,
65
+ "min_length": 0,
66
+ "model_type": "qwen3_tts_talker_code_predictor",
67
+ "no_repeat_ngram_size": 0,
68
+ "num_attention_heads": 16,
69
+ "num_beam_groups": 1,
70
+ "num_beams": 1,
71
+ "num_code_groups": 16,
72
+ "num_hidden_layers": 5,
73
+ "num_key_value_heads": 8,
74
+ "num_return_sequences": 1,
75
+ "output_attentions": false,
76
+ "output_hidden_states": false,
77
+ "output_scores": false,
78
+ "pad_token_id": null,
79
+ "prefix": null,
80
+ "problem_type": null,
81
+ "pruned_heads": {},
82
+ "remove_invalid_values": false,
83
+ "repetition_penalty": 1.0,
84
+ "return_dict": true,
85
+ "return_dict_in_generate": false,
86
+ "rms_norm_eps": 1e-06,
87
+ "rope_scaling": null,
88
+ "rope_theta": 1000000,
89
+ "sep_token_id": null,
90
+ "sliding_window": null,
91
+ "suppress_tokens": null,
92
+ "task_specific_params": null,
93
+ "temperature": 1.0,
94
+ "tf_legacy_loss": false,
95
+ "tie_encoder_decoder": false,
96
+ "tie_word_embeddings": false,
97
+ "tokenizer_class": null,
98
+ "top_k": 50,
99
+ "top_p": 1.0,
100
+ "dtype": null,
101
+ "torchscript": false,
102
+ "typical_p": 1.0,
103
+ "use_bfloat16": false,
104
+ "use_cache": true,
105
+ "use_sliding_window": false,
106
+ "vocab_size": 2048
107
+ },
108
+ "codec_bos_id": 2149,
109
+ "codec_eos_token_id": 2150,
110
+ "codec_think_id": 2154,
111
+ "codec_language_id": {
112
+ "chinese": 2055,
113
+ "english": 2050,
114
+ "german": 2053,
115
+ "italian": 2070,
116
+ "portuguese": 2071,
117
+ "spanish": 2054,
118
+ "japanese": 2058,
119
+ "korean": 2064,
120
+ "french": 2061,
121
+ "russian": 2069
122
+ },
123
+ "codec_nothink_id": 2155,
124
+ "codec_pad_id": 2148,
125
+ "codec_think_bos_id": 2156,
126
+ "codec_think_eos_id": 2157,
127
+ "spk_id": {
128
+ "data": 3000
129
+ },
130
+ "spk_is_dialect": {
131
+ "data": false
132
+ },
133
+ "head_dim": 128,
134
+ "hidden_act": "silu",
135
+ "hidden_size": 2048,
136
+ "initializer_range": 0.02,
137
+ "intermediate_size": 6144,
138
+ "max_position_embeddings": 32768,
139
+ "model_type": "qwen3_tts_talker",
140
+ "num_attention_heads": 16,
141
+ "num_code_groups": 16,
142
+ "num_hidden_layers": 28,
143
+ "num_key_value_heads": 8,
144
+ "position_id_per_seconds": 13,
145
+ "rms_norm_eps": 1e-06,
146
+ "rope_scaling": {
147
+ "interleaved": true,
148
+ "mrope_section": [
149
+ 24,
150
+ 20,
151
+ 20
152
+ ],
153
+ "rope_type": "default",
154
+ "type": "default"
155
+ },
156
+ "rope_theta": 1000000,
157
+ "sliding_window": null,
158
+ "text_hidden_size": 2048,
159
+ "text_vocab_size": 151936,
160
+ "use_cache": true,
161
+ "use_sliding_window": false,
162
+ "vocab_size": 3072
163
+ },
164
+ "transformers_version": "4.57.3"
165
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "repetition_penalty": 1.05,
4
+ "temperature": 0.9,
5
+ "top_p": 1.0,
6
+ "top_k": 50,
7
+ "subtalker_dosample": true,
8
+ "subtalker_temperature": 0.9,
9
+ "subtalker_top_p": 1.0,
10
+ "subtalker_top_k": 50,
11
+ "max_new_tokens": 8192
12
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
miner.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from concurrent.futures import ThreadPoolExecutor, TimeoutError as FutureTimeout
4
+ from pathlib import Path
5
+ from typing import Any
6
+
7
+ import numpy as np
8
+
9
+
10
+ VOCENCE_CONFIG = "vocence_config.yaml"
11
+ QWEN_ANCHOR = "config.json"
12
+ WARMUP_SECONDS = 180.0
13
+
14
+
15
+ def _load_yaml(path: Path) -> dict[str, Any]:
16
+ if not path.is_file():
17
+ return {}
18
+ from yaml import safe_load
19
+ with path.open("r", encoding="utf-8") as fh:
20
+ return safe_load(fh) or {}
21
+
22
+
23
+ def _select_device(prefer_cuda: bool):
24
+ import torch
25
+ has_cuda = torch.cuda.is_available()
26
+ device = "cuda:0" if (prefer_cuda and has_cuda) else "cpu"
27
+ return device, torch, has_cuda
28
+
29
+
30
+ def _select_dtype(torch_mod, want_bf16: bool, has_cuda: bool):
31
+ return torch_mod.bfloat16 if (want_bf16 and has_cuda) else torch_mod.float32
32
+
33
+
34
+ def _build_qwen(snapshot: Path, device: str, dtype: Any, attn: str):
35
+ from qwen_tts import Qwen3TTSModel
36
+ return Qwen3TTSModel.from_pretrained(
37
+ pretrained_model_name_or_path=str(snapshot),
38
+ device_map=device,
39
+ dtype=dtype,
40
+ attn_implementation=attn,
41
+ )
42
+
43
+
44
+ def _attn_order(prefer_flash: bool) -> tuple[str, ...]:
45
+ return ("flash_attention_2", "sdpa") if prefer_flash else ("sdpa",)
46
+
47
+
48
+ def _mono_pcm(arr: Any) -> np.ndarray:
49
+ wave = np.asarray(arr, dtype=np.float32)
50
+ return wave.mean(axis=1) if wave.ndim > 1 else wave
51
+
52
+
53
+ def _settings(snapshot: Path) -> dict[str, Any]:
54
+ raw = _load_yaml(snapshot / VOCENCE_CONFIG)
55
+ rt = raw.get("runtime") or {}
56
+ gen = raw.get("generation") or {}
57
+ lim = raw.get("limits") or {}
58
+ return {
59
+ "language": str(lim.get("default_language") or rt.get("default_language") or "English"),
60
+ "sample_rate": int(gen.get("sample_rate", 24000)),
61
+ "cap_instruct": int(lim.get("max_instruction_chars", 600)),
62
+ "cap_text": int(lim.get("max_text_chars", 2000)),
63
+ "prefer_cuda": str(rt.get("device_preference", "cuda")).lower() == "cuda",
64
+ "prefer_bf16": str(rt.get("dtype", "bfloat16")).lower() == "bfloat16",
65
+ "prefer_flash": bool(rt.get("use_flash_attention_2", False)),
66
+ }
67
+
68
+
69
+ class Miner:
70
+
71
+ def __init__(self, path_hf_repo: Path) -> None:
72
+ snapshot = Path(path_hf_repo).resolve()
73
+ if not (snapshot / QWEN_ANCHOR).is_file():
74
+ raise FileNotFoundError(f"snapshot missing {QWEN_ANCHOR}: {snapshot}")
75
+ self.snapshot = snapshot
76
+ self.cfg = _settings(snapshot)
77
+
78
+ device, torch_mod, has_cuda = _select_device(self.cfg["prefer_cuda"])
79
+ dtype = _select_dtype(torch_mod, self.cfg["prefer_bf16"], has_cuda)
80
+
81
+ last_err: BaseException | None = None
82
+ engine = None
83
+ for attn in _attn_order(self.cfg["prefer_flash"]):
84
+ try:
85
+ engine = _build_qwen(snapshot, device, dtype, attn)
86
+ tag = "bf16" if self.cfg["prefer_bf16"] and has_cuda else "fp32"
87
+ print(f"[Miner] qwen3-tts ready: device={device} dtype={tag} attn={attn}")
88
+ break
89
+ except Exception as exc:
90
+ last_err = exc
91
+ if engine is None:
92
+ raise RuntimeError(f"qwen3-tts load failed: {last_err!r}")
93
+ self.engine = engine
94
+
95
+ def __repr__(self) -> str:
96
+ return f"<Miner snapshot={self.snapshot.name} lang={self.cfg['language']!r}>"
97
+
98
+ def warmup(self) -> None:
99
+ with ThreadPoolExecutor(max_workers=1) as pool:
100
+ future = pool.submit(self.generate_wav, "Neutral voice.", "Warmup phrase.")
101
+ try:
102
+ future.result(timeout=WARMUP_SECONDS)
103
+ except FutureTimeout:
104
+ raise RuntimeError(f"Miner warmup exceeded {WARMUP_SECONDS}s")
105
+
106
+ def generate_wav(self, instruction: str, text: str) -> tuple[np.ndarray, int]:
107
+ cap_i = self.cfg["cap_instruct"]
108
+ cap_t = self.cfg["cap_text"]
109
+ prompt = instruction[:cap_i] if cap_i > 0 else instruction
110
+ body = text[:cap_t] if cap_t > 0 else text
111
+ wavs, sr = self.engine.generate_voice_design(
112
+ text=body,
113
+ instruct=prompt,
114
+ language=self.cfg["language"],
115
+ )
116
+ if not wavs or wavs[0] is None:
117
+ raise ValueError("qwen3-tts returned no audio")
118
+ return _mono_pcm(wavs[0]), int(sr)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5ce24aab3afac1df69bc2b2807140ff90a082d98782c3d3126ab3a4ed5bc888
3
+ size 3833402520
preprocessor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "padding_side": "left",
3
+ "padding_value": 0.0,
4
+ "processor_class": "Qwen3TTSProcessor",
5
+ "return_attention_mask": true
6
+ }
speech_tokenizer/config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3TTSTokenizerV2Model"
4
+ ],
5
+ "model_type": "qwen3_tts_tokenizer_12hz",
6
+ "encoder_valid_num_quantizers": 16,
7
+ "input_sample_rate": 24000,
8
+ "output_sample_rate": 24000,
9
+ "decode_upsample_rate": 1920,
10
+ "encode_downsample_rate": 1920,
11
+ "decoder_config": {
12
+ "attention_bias": false,
13
+ "attention_dropout": 0.0,
14
+ "latent_dim": 1024,
15
+ "codebook_dim": 512,
16
+ "codebook_size": 2048,
17
+ "decoder_dim": 1536,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 512,
20
+ "intermediate_size": 1024,
21
+ "layer_scale_initial_scale": 0.01,
22
+ "max_position_embeddings": 8000,
23
+ "head_dim": 64,
24
+ "num_attention_heads": 16,
25
+ "num_hidden_layers": 8,
26
+ "num_key_value_heads": 16,
27
+ "num_quantizers": 16,
28
+ "num_semantic_quantizers": 1,
29
+ "rms_norm_eps": 1e-05,
30
+ "rope_theta": 10000,
31
+ "semantic_codebook_size": 4096,
32
+ "sliding_window": 72,
33
+ "upsample_rates": [
34
+ 8,
35
+ 5,
36
+ 4,
37
+ 3
38
+ ],
39
+ "upsampling_ratios": [
40
+ 2,
41
+ 2
42
+ ],
43
+ "vector_quantization_hidden_dimension": 512
44
+ },
45
+ "encoder_config": {
46
+ "_frame_rate": 12.5,
47
+ "attention_bias": false,
48
+ "attention_dropout": 0.0,
49
+ "audio_channels": 1,
50
+ "codebook_dim": 256,
51
+ "codebook_size": 2048,
52
+ "compress": 2,
53
+ "dilation_growth_rate": 2,
54
+ "dtype": "float32",
55
+ "head_dim": 64,
56
+ "hidden_act": "gelu",
57
+ "hidden_size": 512,
58
+ "initializer_range": 0.02,
59
+ "intermediate_size": 2048,
60
+ "kernel_size": 7,
61
+ "last_kernel_size": 3,
62
+ "layer_scale_initial_scale": 0.01,
63
+ "max_position_embeddings": 8000,
64
+ "norm_eps": 1e-05,
65
+ "normalize": false,
66
+ "num_attention_heads": 8,
67
+ "num_filters": 64,
68
+ "num_hidden_layers": 8,
69
+ "num_key_value_heads": 8,
70
+ "num_quantizers": 32,
71
+ "num_residual_layers": 1,
72
+ "num_semantic_quantizers": 1,
73
+ "pad_mode": "constant",
74
+ "residual_kernel_size": 3,
75
+ "rope_theta": 10000.0,
76
+ "sampling_rate": 24000,
77
+ "sliding_window": 250,
78
+ "transformers_version": "4.57.0.dev0",
79
+ "trim_right_ratio": 1.0,
80
+ "upsample_groups": 512,
81
+ "upsampling_ratios": [
82
+ 8,
83
+ 6,
84
+ 5,
85
+ 4
86
+ ],
87
+ "use_cache": false,
88
+ "use_causal_conv": true,
89
+ "use_conv_shortcut": false,
90
+ "use_streaming": false,
91
+ "vector_quantization_hidden_dimension": 256
92
+ },
93
+ "transformers_version": "4.57.3"
94
+ }
speech_tokenizer/configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework": "pytorch", "task": "feature-extraction", "allow_remote": true}
speech_tokenizer/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:836b7b357f5ea43e889936a3709af68dfe3751881acefe4ecf0dbd30ba571258
3
+ size 682293092
speech_tokenizer/preprocessor_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length_s": null,
3
+ "feature_extractor_type": "EncodecFeatureExtractor",
4
+ "feature_size": 1,
5
+ "overlap": null,
6
+ "padding_side": "right",
7
+ "padding_value": 0.0,
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 24000
10
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151669": {
214
+ "content": "<|audio_start|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<|audio_end|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "<tts_pad>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<tts_text_bos>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "<tts_text_eod>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<tts_text_bos_single>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<|audio_pad|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ }
269
+ },
270
+ "additional_special_tokens": [
271
+ "<|im_start|>",
272
+ "<|im_end|>",
273
+ "<|object_ref_start|>",
274
+ "<|object_ref_end|>",
275
+ "<|box_start|>",
276
+ "<|box_end|>",
277
+ "<|quad_start|>",
278
+ "<|quad_end|>",
279
+ "<|vision_start|>",
280
+ "<|vision_end|>",
281
+ "<|vision_pad|>",
282
+ "<|image_pad|>",
283
+ "<|video_pad|>",
284
+ "<|audio_start|>",
285
+ "<|audio_end|>",
286
+ "<tts_pad>",
287
+ "<tts_text_bos>",
288
+ "<tts_text_bos_single>",
289
+ "<|audio_pad|>"
290
+ ],
291
+ "extra_special_tokens": {
292
+ "image_token": "<|image_pad|>",
293
+ "audio_token": "<|audio_pad|>",
294
+ "video_token": "<|video_pad|>",
295
+ "vision_bos_token": "<|vision_start|>",
296
+ "vision_eos_token": "<|vision_end|>",
297
+ "audio_bos_token": "<|audio_start|>",
298
+ "audio_eos_token": "<|audio_end|>"
299
+ },
300
+ "bos_token": null,
301
+ "clean_up_tokenization_spaces": false,
302
+ "eos_token": "<|im_end|>",
303
+ "errors": "replace",
304
+ "model_max_length": 131072,
305
+ "pad_token": "<|endoftext|>",
306
+ "split_special_tokens": false,
307
+ "tokenizer_class": "Qwen2Tokenizer",
308
+ "unk_token": null,
309
+ "image_token": "<|image_pad|>",
310
+ "audio_token": "<|audio_pad|>",
311
+ "video_token": "<|video_pad|>",
312
+ "vision_bos_token": "<|vision_start|>",
313
+ "vision_eos_token": "<|vision_end|>",
314
+ "audio_bos_token": "<|audio_start|>",
315
+ "audio_eos_token": "<|audio_end|>"
316
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
vocence_config.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Miner + /health metadata. Weights live in this HF repo (no runtime model_id).
2
+ runtime:
3
+ adapter: "qwen3_tts_repo_snapshot"
4
+ device_preference: "cuda"
5
+ dtype: "bfloat16"
6
+ default_language: "English"
7
+ use_flash_attention_2: false
8
+
9
+ generation:
10
+ sample_rate: 24000
11
+ max_seconds: 30
12
+
13
+ limits:
14
+ max_text_chars: 2000
15
+ max_instruction_chars: 600
16
+ default_language: "English"