Upload folder using huggingface_hub
Browse files- README.md +88 -0
- __pycache__/miner.cpython-313.pyc +0 -0
- chute_config.yml +27 -0
- config.json +163 -0
- generation_config.json +12 -0
- merges.txt +0 -0
- miner.py +160 -0
- model.safetensors +3 -0
- preprocessor_config.json +6 -0
- speech_tokenizer/config.json +94 -0
- speech_tokenizer/configuration.json +1 -0
- speech_tokenizer/model.safetensors +3 -0
- speech_tokenizer/preprocessor_config.json +10 -0
- tokenizer_config.json +316 -0
- vocab.json +0 -0
- vocence_config.yaml +26 -0
README.md
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-to-speech
|
| 4 |
+
library_name: qwen-tts
|
| 5 |
+
tags:
|
| 6 |
+
- audio
|
| 7 |
+
- tts
|
| 8 |
+
- qwen
|
| 9 |
+
- multilingual
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Qwen3-TTS
|
| 13 |
+
|
| 14 |
+
<br>
|
| 15 |
+
|
| 16 |
+
<p align="center">
|
| 17 |
+
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/qwen3_tts_logo.png" width="400"/>
|
| 18 |
+
<p>
|
| 19 |
+
|
| 20 |
+
<p align="center">
|
| 21 |
+
  🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>   |   📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>   |   📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>   |   💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
|
| 22 |
+
</p>
|
| 23 |
+
|
| 24 |
+
We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
|
| 25 |
+
|
| 26 |
+
## Overview
|
| 27 |
+
Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
|
| 28 |
+
|
| 29 |
+
* **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
|
| 30 |
+
* **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
|
| 31 |
+
* **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
|
| 32 |
+
* **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
|
| 33 |
+
|
| 34 |
+
## Quickstart
|
| 35 |
+
|
| 36 |
+
### Environment Setup
|
| 37 |
+
|
| 38 |
+
Install the `qwen-tts` Python package from PyPI:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
pip install -U qwen-tts
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
### Python Package Usage
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
import torch
|
| 48 |
+
import soundfile as sf
|
| 49 |
+
from qwen_tts import Qwen3TTSModel
|
| 50 |
+
|
| 51 |
+
# Load the model
|
| 52 |
+
model = Qwen3TTSModel.from_pretrained(
|
| 53 |
+
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
|
| 54 |
+
device_map="cuda:0",
|
| 55 |
+
dtype=torch.bfloat16,
|
| 56 |
+
attn_implementation="flash_attention_2",
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
# Custom Voice Generation
|
| 60 |
+
wavs, sr = model.generate_custom_voice(
|
| 61 |
+
text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
|
| 62 |
+
language="Chinese",
|
| 63 |
+
speaker="Vivian",
|
| 64 |
+
instruct="用特别愤怒的语气说",
|
| 65 |
+
)
|
| 66 |
+
sf.write("output.wav", wavs[0], sr)
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Evaluation
|
| 70 |
+
|
| 71 |
+
Zero-shot speech generation on the Seed-TTS test set (Word Error Rate (WER, ↓)):
|
| 72 |
+
|
| 73 |
+
| Model | test-zh | test-en |
|
| 74 |
+
|---|---|---|
|
| 75 |
+
| Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
|
| 76 |
+
|
| 77 |
+
## Citation
|
| 78 |
+
|
| 79 |
+
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:
|
| 80 |
+
|
| 81 |
+
```BibTeX
|
| 82 |
+
@article{Qwen3-TTS,
|
| 83 |
+
title={Qwen3-TTS Technical Report},
|
| 84 |
+
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
|
| 85 |
+
journal={arXiv preprint arXiv:2601.15621},
|
| 86 |
+
year={2026}
|
| 87 |
+
}
|
| 88 |
+
```
|
__pycache__/miner.cpython-313.pyc
ADDED
|
Binary file (8.3 kB). View file
|
|
|
chute_config.yml
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Image + node + Chute for Vocence deploy. Required in the HF repo at build time.
|
| 2 |
+
# Keep deps minimal — anything imported by miner.py must not be in the validator's
|
| 3 |
+
# banned list (requests/urllib/httpx/aiohttp/socket/huggingface_hub/importlib/torch.hub).
|
| 4 |
+
# huggingface_hub is intentionally NOT installed so an accidental future import
|
| 5 |
+
# in miner.py fails fast at runtime instead of silently being available.
|
| 6 |
+
|
| 7 |
+
Image:
|
| 8 |
+
from_base: parachutes/base-python:3.12.9
|
| 9 |
+
run_command:
|
| 10 |
+
- pip install torch torchaudio transformers==4.57.3 accelerate pyyaml soundfile
|
| 11 |
+
- pip install -U qwen-tts
|
| 12 |
+
set_workdir: /app
|
| 13 |
+
|
| 14 |
+
NodeSelector:
|
| 15 |
+
gpu_count: 1
|
| 16 |
+
min_vram_gb_per_gpu: 32
|
| 17 |
+
include: ["pro_6000"]
|
| 18 |
+
exclude: []
|
| 19 |
+
|
| 20 |
+
Chute:
|
| 21 |
+
tagline: vocence qwen3-tts miner
|
| 22 |
+
readme: vocence chute serving qwen3-tts via miner.py (weights pinned in repo)
|
| 23 |
+
shutdown_after_seconds: 86400
|
| 24 |
+
concurrency: 1
|
| 25 |
+
max_instances: 1
|
| 26 |
+
scaling_threshold: 0.5
|
| 27 |
+
tee: true
|
config.json
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"Qwen3TTSForConditionalGeneration"
|
| 4 |
+
],
|
| 5 |
+
"assistant_token_id": 77091,
|
| 6 |
+
"im_end_token_id": 151645,
|
| 7 |
+
"im_start_token_id": 151644,
|
| 8 |
+
"tts_bos_token_id": 151672,
|
| 9 |
+
"tts_eos_token_id": 151673,
|
| 10 |
+
"tts_pad_token_id": 151671,
|
| 11 |
+
"model_type": "qwen3_tts",
|
| 12 |
+
"tokenizer_type": "qwen3_tts_tokenizer_12hz",
|
| 13 |
+
"tts_model_size": "1b7",
|
| 14 |
+
"tts_model_type": "voice_design",
|
| 15 |
+
"talker_config": {
|
| 16 |
+
"attention_bias": false,
|
| 17 |
+
"attention_dropout": 0,
|
| 18 |
+
"code_predictor_config": {
|
| 19 |
+
"_name_or_path": "",
|
| 20 |
+
"add_cross_attention": false,
|
| 21 |
+
"architectures": null,
|
| 22 |
+
"attention_bias": false,
|
| 23 |
+
"attention_dropout": 0,
|
| 24 |
+
"bad_words_ids": null,
|
| 25 |
+
"begin_suppress_tokens": null,
|
| 26 |
+
"bos_token_id": null,
|
| 27 |
+
"chunk_size_feed_forward": 0,
|
| 28 |
+
"cross_attention_hidden_size": null,
|
| 29 |
+
"decoder_start_token_id": null,
|
| 30 |
+
"diversity_penalty": 0.0,
|
| 31 |
+
"do_sample": false,
|
| 32 |
+
"early_stopping": false,
|
| 33 |
+
"encoder_no_repeat_ngram_size": 0,
|
| 34 |
+
"eos_token_id": null,
|
| 35 |
+
"exponential_decay_length_penalty": null,
|
| 36 |
+
"finetuning_task": null,
|
| 37 |
+
"forced_bos_token_id": null,
|
| 38 |
+
"forced_eos_token_id": null,
|
| 39 |
+
"head_dim": 128,
|
| 40 |
+
"hidden_act": "silu",
|
| 41 |
+
"hidden_size": 1024,
|
| 42 |
+
"id2label": {
|
| 43 |
+
"0": "LABEL_0",
|
| 44 |
+
"1": "LABEL_1"
|
| 45 |
+
},
|
| 46 |
+
"initializer_range": 0.02,
|
| 47 |
+
"intermediate_size": 3072,
|
| 48 |
+
"is_decoder": false,
|
| 49 |
+
"is_encoder_decoder": false,
|
| 50 |
+
"label2id": {
|
| 51 |
+
"LABEL_0": 0,
|
| 52 |
+
"LABEL_1": 1
|
| 53 |
+
},
|
| 54 |
+
"layer_types": [
|
| 55 |
+
"full_attention",
|
| 56 |
+
"full_attention",
|
| 57 |
+
"full_attention",
|
| 58 |
+
"full_attention",
|
| 59 |
+
"full_attention"
|
| 60 |
+
],
|
| 61 |
+
"length_penalty": 1.0,
|
| 62 |
+
"max_length": 20,
|
| 63 |
+
"max_position_embeddings": 65536,
|
| 64 |
+
"max_window_layers": 28,
|
| 65 |
+
"min_length": 0,
|
| 66 |
+
"model_type": "qwen3_tts_talker_code_predictor",
|
| 67 |
+
"no_repeat_ngram_size": 0,
|
| 68 |
+
"num_attention_heads": 16,
|
| 69 |
+
"num_beam_groups": 1,
|
| 70 |
+
"num_beams": 1,
|
| 71 |
+
"num_code_groups": 16,
|
| 72 |
+
"num_hidden_layers": 5,
|
| 73 |
+
"num_key_value_heads": 8,
|
| 74 |
+
"num_return_sequences": 1,
|
| 75 |
+
"output_attentions": false,
|
| 76 |
+
"output_hidden_states": false,
|
| 77 |
+
"output_scores": false,
|
| 78 |
+
"pad_token_id": null,
|
| 79 |
+
"prefix": null,
|
| 80 |
+
"problem_type": null,
|
| 81 |
+
"pruned_heads": {},
|
| 82 |
+
"remove_invalid_values": false,
|
| 83 |
+
"repetition_penalty": 1.0,
|
| 84 |
+
"return_dict": true,
|
| 85 |
+
"return_dict_in_generate": false,
|
| 86 |
+
"rms_norm_eps": 1e-06,
|
| 87 |
+
"rope_scaling": null,
|
| 88 |
+
"rope_theta": 1000000,
|
| 89 |
+
"sep_token_id": null,
|
| 90 |
+
"sliding_window": null,
|
| 91 |
+
"suppress_tokens": null,
|
| 92 |
+
"task_specific_params": null,
|
| 93 |
+
"temperature": 1.0,
|
| 94 |
+
"tf_legacy_loss": false,
|
| 95 |
+
"tie_encoder_decoder": false,
|
| 96 |
+
"tie_word_embeddings": false,
|
| 97 |
+
"tokenizer_class": null,
|
| 98 |
+
"top_k": 50,
|
| 99 |
+
"top_p": 1.0,
|
| 100 |
+
"dtype": null,
|
| 101 |
+
"torchscript": false,
|
| 102 |
+
"typical_p": 1.0,
|
| 103 |
+
"use_bfloat16": false,
|
| 104 |
+
"use_cache": true,
|
| 105 |
+
"use_sliding_window": false,
|
| 106 |
+
"vocab_size": 2048
|
| 107 |
+
},
|
| 108 |
+
"codec_bos_id": 2149,
|
| 109 |
+
"codec_eos_token_id": 2150,
|
| 110 |
+
"codec_think_id": 2154,
|
| 111 |
+
"codec_language_id": {
|
| 112 |
+
"chinese": 2055,
|
| 113 |
+
"english": 2050,
|
| 114 |
+
"german": 2053,
|
| 115 |
+
"italian": 2070,
|
| 116 |
+
"portuguese": 2071,
|
| 117 |
+
"spanish": 2054,
|
| 118 |
+
"japanese": 2058,
|
| 119 |
+
"korean": 2064,
|
| 120 |
+
"french": 2061,
|
| 121 |
+
"russian": 2069
|
| 122 |
+
},
|
| 123 |
+
"codec_nothink_id": 2155,
|
| 124 |
+
"codec_pad_id": 2148,
|
| 125 |
+
"codec_think_bos_id": 2156,
|
| 126 |
+
"codec_think_eos_id": 2157,
|
| 127 |
+
"spk_id": {
|
| 128 |
+
},
|
| 129 |
+
"spk_is_dialect": {
|
| 130 |
+
},
|
| 131 |
+
"head_dim": 128,
|
| 132 |
+
"hidden_act": "silu",
|
| 133 |
+
"hidden_size": 2048,
|
| 134 |
+
"initializer_range": 0.02,
|
| 135 |
+
"intermediate_size": 6144,
|
| 136 |
+
"max_position_embeddings": 32768,
|
| 137 |
+
"model_type": "qwen3_tts_talker",
|
| 138 |
+
"num_attention_heads": 16,
|
| 139 |
+
"num_code_groups": 16,
|
| 140 |
+
"num_hidden_layers": 28,
|
| 141 |
+
"num_key_value_heads": 8,
|
| 142 |
+
"position_id_per_seconds": 13,
|
| 143 |
+
"rms_norm_eps": 1e-06,
|
| 144 |
+
"rope_scaling": {
|
| 145 |
+
"interleaved": true,
|
| 146 |
+
"mrope_section": [
|
| 147 |
+
24,
|
| 148 |
+
20,
|
| 149 |
+
20
|
| 150 |
+
],
|
| 151 |
+
"rope_type": "default",
|
| 152 |
+
"type": "default"
|
| 153 |
+
},
|
| 154 |
+
"rope_theta": 1000000,
|
| 155 |
+
"sliding_window": null,
|
| 156 |
+
"text_hidden_size": 2048,
|
| 157 |
+
"text_vocab_size": 151936,
|
| 158 |
+
"use_cache": true,
|
| 159 |
+
"use_sliding_window": false,
|
| 160 |
+
"vocab_size": 3072
|
| 161 |
+
},
|
| 162 |
+
"transformers_version": "4.57.3"
|
| 163 |
+
}
|
generation_config.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"do_sample": true,
|
| 3 |
+
"repetition_penalty": 1.05,
|
| 4 |
+
"temperature": 0.9,
|
| 5 |
+
"top_p": 1.0,
|
| 6 |
+
"top_k": 50,
|
| 7 |
+
"subtalker_dosample": true,
|
| 8 |
+
"subtalker_temperature": 0.9,
|
| 9 |
+
"subtalker_top_p": 1.0,
|
| 10 |
+
"subtalker_top_k": 50,
|
| 11 |
+
"max_new_tokens": 8192
|
| 12 |
+
}
|
merges.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
miner.py
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Vocence TTS engine: Qwen3 12Hz checkpoint in the HF repo snapshot.
|
| 3 |
+
|
| 4 |
+
The chute snapshot is the only weight source: nothing is pulled from an external
|
| 5 |
+
model id at inference time. Optional vocence_config.yaml tweaks device, dtype,
|
| 6 |
+
attention, and language defaults.
|
| 7 |
+
|
| 8 |
+
Model load: Miner.__init__ -> _instantiate_qwen() -> Qwen3TTSModel.from_pretrained(repo_path).
|
| 9 |
+
|
| 10 |
+
Contract (Vocence):
|
| 11 |
+
Miner(path_hf_repo: Path)
|
| 12 |
+
warmup() -> None
|
| 13 |
+
generate_wav(instruction: str, text: str) -> tuple[np.ndarray, int]
|
| 14 |
+
"""
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
import threading
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
from typing import Any, Mapping
|
| 20 |
+
|
| 21 |
+
import numpy as np
|
| 22 |
+
|
| 23 |
+
_CONFIG_NAME = "config.json"
|
| 24 |
+
_VOCENCE_YAML = "vocence_config.yaml"
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def _merge_vocence_yaml(repo: Path) -> dict[str, Any]:
|
| 28 |
+
path = repo / _VOCENCE_YAML
|
| 29 |
+
if not path.is_file():
|
| 30 |
+
return {}
|
| 31 |
+
from yaml import safe_load
|
| 32 |
+
|
| 33 |
+
with path.open("r", encoding="utf-8") as fh:
|
| 34 |
+
data = safe_load(fh)
|
| 35 |
+
return data if isinstance(data, Mapping) else {}
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _ensure_repo_checkpoint(repo: Path) -> Path:
|
| 39 |
+
repo = repo.resolve()
|
| 40 |
+
marker = repo / _CONFIG_NAME
|
| 41 |
+
if not marker.is_file():
|
| 42 |
+
raise FileNotFoundError(
|
| 43 |
+
f"Model snapshot incomplete: {marker} missing. "
|
| 44 |
+
"Host the full Qwen3-TTS weights (checkpoint + tokenizers) in this repository."
|
| 45 |
+
)
|
| 46 |
+
return repo
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def _resolve_compute_device(prefer_cuda: bool) -> str:
|
| 50 |
+
import torch
|
| 51 |
+
|
| 52 |
+
if prefer_cuda and torch.cuda.is_available():
|
| 53 |
+
return "cuda:0"
|
| 54 |
+
return "cpu"
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def _resolve_torch_dtype(torch, prefer_bf16: bool):
|
| 58 |
+
if prefer_bf16 and torch.cuda.is_available():
|
| 59 |
+
return torch.bfloat16
|
| 60 |
+
return torch.float32
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _instantiate_qwen(checkpoint_dir: str, device_map: str, torch_dtype, use_flash2: bool):
|
| 64 |
+
"""Load Qwen3TTSModel weights from the local repo directory (HF snapshot path)."""
|
| 65 |
+
from qwen_tts import Qwen3TTSModel
|
| 66 |
+
|
| 67 |
+
attn = "flash_attention_2" if use_flash2 else "sdpa"
|
| 68 |
+
common = dict(
|
| 69 |
+
pretrained_model_name_or_path=checkpoint_dir,
|
| 70 |
+
device_map=device_map,
|
| 71 |
+
dtype=torch_dtype,
|
| 72 |
+
attn_implementation=attn,
|
| 73 |
+
)
|
| 74 |
+
try:
|
| 75 |
+
return Qwen3TTSModel.from_pretrained(**common)
|
| 76 |
+
except Exception:
|
| 77 |
+
common["attn_implementation"] = "sdpa"
|
| 78 |
+
return Qwen3TTSModel.from_pretrained(**common)
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def _to_mono_f32(segment: np.ndarray) -> np.ndarray:
|
| 82 |
+
x = np.asarray(segment, dtype=np.float32)
|
| 83 |
+
if x.ndim > 1:
|
| 84 |
+
x = x.mean(axis=1)
|
| 85 |
+
return x
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
class Miner:
|
| 89 |
+
"""
|
| 90 |
+
Loads the checkpoint from the Hugging Face repo directory Chutes downloaded.
|
| 91 |
+
Synthesis uses natural-language instruction + text (qwen-tts API).
|
| 92 |
+
"""
|
| 93 |
+
|
| 94 |
+
def __init__(self, path_hf_repo: Path) -> None:
|
| 95 |
+
self._root = _ensure_repo_checkpoint(Path(path_hf_repo))
|
| 96 |
+
self._cfg = _merge_vocence_yaml(self._root)
|
| 97 |
+
rt = self._cfg.get("runtime") or {}
|
| 98 |
+
gen = self._cfg.get("generation") or {}
|
| 99 |
+
lim = self._cfg.get("limits") or {}
|
| 100 |
+
|
| 101 |
+
self._language = str(lim.get("default_language") or rt.get("default_language", "English"))
|
| 102 |
+
self._output_sr = int(gen.get("sample_rate", 24000))
|
| 103 |
+
self._cap_instruction = int(lim.get("max_instruction_chars", 600))
|
| 104 |
+
self._cap_text = int(lim.get("max_text_chars", 2000))
|
| 105 |
+
|
| 106 |
+
prefer_cuda = str(rt.get("device_preference", "cuda")).lower() == "cuda"
|
| 107 |
+
want_bf16 = str(rt.get("dtype", "bfloat16")).lower() == "bfloat16"
|
| 108 |
+
flash = bool(rt.get("use_flash_attention_2", False))
|
| 109 |
+
|
| 110 |
+
import torch
|
| 111 |
+
|
| 112 |
+
device_map = _resolve_compute_device(prefer_cuda)
|
| 113 |
+
torch_dtype = _resolve_torch_dtype(torch, want_bf16)
|
| 114 |
+
ckpt = str(self._root)
|
| 115 |
+
|
| 116 |
+
self._tts = _instantiate_qwen(ckpt, device_map, torch_dtype, flash)
|
| 117 |
+
# Qwen3TTSModel is a thin wrapper, not nn.Module — no .eval()
|
| 118 |
+
print("Qwen3-TTS checkpoint ready (loaded from repo snapshot).")
|
| 119 |
+
|
| 120 |
+
def __repr__(self) -> str:
|
| 121 |
+
return "Miner(qwen3-tts-local, local_snapshot=True)"
|
| 122 |
+
|
| 123 |
+
def warmup(self) -> None:
|
| 124 |
+
"""Force one cheap synthesis on a background thread (startup SLAs)."""
|
| 125 |
+
status: dict[str, object] = {"done": False, "error": None}
|
| 126 |
+
|
| 127 |
+
def _once() -> None:
|
| 128 |
+
try:
|
| 129 |
+
self.generate_wav(
|
| 130 |
+
instruction="Clear, neutral delivery.",
|
| 131 |
+
text="Warmup.",
|
| 132 |
+
)
|
| 133 |
+
status["done"] = True
|
| 134 |
+
except Exception as exc: # noqa: BLE001 — surface to host
|
| 135 |
+
status["error"] = str(exc)
|
| 136 |
+
|
| 137 |
+
worker = threading.Thread(target=_once, daemon=True)
|
| 138 |
+
worker.start()
|
| 139 |
+
worker.join(timeout=180.0)
|
| 140 |
+
if not status["done"]:
|
| 141 |
+
raise RuntimeError(status["error"] or "warmup exceeded 180s")
|
| 142 |
+
|
| 143 |
+
def generate_wav(self, instruction: str, text: str) -> tuple[np.ndarray, int]:
|
| 144 |
+
if self._cap_instruction > 0:
|
| 145 |
+
instruction = instruction[: self._cap_instruction]
|
| 146 |
+
if self._cap_text > 0:
|
| 147 |
+
text = text[: self._cap_text]
|
| 148 |
+
|
| 149 |
+
# Upstream qwen-tts method name (instruct + text -> waveform).
|
| 150 |
+
waves, sr = self._tts.generate_voice_design(
|
| 151 |
+
text=text,
|
| 152 |
+
language=self._language,
|
| 153 |
+
instruct=instruction,
|
| 154 |
+
)
|
| 155 |
+
if not waves:
|
| 156 |
+
raise ValueError("TTS generation returned no audio")
|
| 157 |
+
first = waves[0]
|
| 158 |
+
if first is None:
|
| 159 |
+
raise ValueError("TTS generation returned empty channel")
|
| 160 |
+
return _to_mono_f32(first), int(sr)
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ea84fb35f7cd71ed436cdb410f0304fb4d7b5971db7d90c7ad6ba7b40a60fec9
|
| 3 |
+
size 3833402552
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"padding_side": "left",
|
| 3 |
+
"padding_value": 0.0,
|
| 4 |
+
"processor_class": "Qwen3TTSProcessor",
|
| 5 |
+
"return_attention_mask": true
|
| 6 |
+
}
|
speech_tokenizer/config.json
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"Qwen3TTSTokenizerV2Model"
|
| 4 |
+
],
|
| 5 |
+
"model_type": "qwen3_tts_tokenizer_12hz",
|
| 6 |
+
"encoder_valid_num_quantizers": 16,
|
| 7 |
+
"input_sample_rate": 24000,
|
| 8 |
+
"output_sample_rate": 24000,
|
| 9 |
+
"decode_upsample_rate": 1920,
|
| 10 |
+
"encode_downsample_rate": 1920,
|
| 11 |
+
"decoder_config": {
|
| 12 |
+
"attention_bias": false,
|
| 13 |
+
"attention_dropout": 0.0,
|
| 14 |
+
"latent_dim": 1024,
|
| 15 |
+
"codebook_dim": 512,
|
| 16 |
+
"codebook_size": 2048,
|
| 17 |
+
"decoder_dim": 1536,
|
| 18 |
+
"hidden_act": "silu",
|
| 19 |
+
"hidden_size": 512,
|
| 20 |
+
"intermediate_size": 1024,
|
| 21 |
+
"layer_scale_initial_scale": 0.01,
|
| 22 |
+
"max_position_embeddings": 8000,
|
| 23 |
+
"head_dim": 64,
|
| 24 |
+
"num_attention_heads": 16,
|
| 25 |
+
"num_hidden_layers": 8,
|
| 26 |
+
"num_key_value_heads": 16,
|
| 27 |
+
"num_quantizers": 16,
|
| 28 |
+
"num_semantic_quantizers": 1,
|
| 29 |
+
"rms_norm_eps": 1e-05,
|
| 30 |
+
"rope_theta": 10000,
|
| 31 |
+
"semantic_codebook_size": 4096,
|
| 32 |
+
"sliding_window": 72,
|
| 33 |
+
"upsample_rates": [
|
| 34 |
+
8,
|
| 35 |
+
5,
|
| 36 |
+
4,
|
| 37 |
+
3
|
| 38 |
+
],
|
| 39 |
+
"upsampling_ratios": [
|
| 40 |
+
2,
|
| 41 |
+
2
|
| 42 |
+
],
|
| 43 |
+
"vector_quantization_hidden_dimension": 512
|
| 44 |
+
},
|
| 45 |
+
"encoder_config": {
|
| 46 |
+
"_frame_rate": 12.5,
|
| 47 |
+
"attention_bias": false,
|
| 48 |
+
"attention_dropout": 0.0,
|
| 49 |
+
"audio_channels": 1,
|
| 50 |
+
"codebook_dim": 256,
|
| 51 |
+
"codebook_size": 2048,
|
| 52 |
+
"compress": 2,
|
| 53 |
+
"dilation_growth_rate": 2,
|
| 54 |
+
"dtype": "float32",
|
| 55 |
+
"head_dim": 64,
|
| 56 |
+
"hidden_act": "gelu",
|
| 57 |
+
"hidden_size": 512,
|
| 58 |
+
"initializer_range": 0.02,
|
| 59 |
+
"intermediate_size": 2048,
|
| 60 |
+
"kernel_size": 7,
|
| 61 |
+
"last_kernel_size": 3,
|
| 62 |
+
"layer_scale_initial_scale": 0.01,
|
| 63 |
+
"max_position_embeddings": 8000,
|
| 64 |
+
"norm_eps": 1e-05,
|
| 65 |
+
"normalize": false,
|
| 66 |
+
"num_attention_heads": 8,
|
| 67 |
+
"num_filters": 64,
|
| 68 |
+
"num_hidden_layers": 8,
|
| 69 |
+
"num_key_value_heads": 8,
|
| 70 |
+
"num_quantizers": 32,
|
| 71 |
+
"num_residual_layers": 1,
|
| 72 |
+
"num_semantic_quantizers": 1,
|
| 73 |
+
"pad_mode": "constant",
|
| 74 |
+
"residual_kernel_size": 3,
|
| 75 |
+
"rope_theta": 10000.0,
|
| 76 |
+
"sampling_rate": 24000,
|
| 77 |
+
"sliding_window": 250,
|
| 78 |
+
"transformers_version": "4.57.0.dev0",
|
| 79 |
+
"trim_right_ratio": 1.0,
|
| 80 |
+
"upsample_groups": 512,
|
| 81 |
+
"upsampling_ratios": [
|
| 82 |
+
8,
|
| 83 |
+
6,
|
| 84 |
+
5,
|
| 85 |
+
4
|
| 86 |
+
],
|
| 87 |
+
"use_cache": false,
|
| 88 |
+
"use_causal_conv": true,
|
| 89 |
+
"use_conv_shortcut": false,
|
| 90 |
+
"use_streaming": false,
|
| 91 |
+
"vector_quantization_hidden_dimension": 256
|
| 92 |
+
},
|
| 93 |
+
"transformers_version": "4.57.3"
|
| 94 |
+
}
|
speech_tokenizer/configuration.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"framework": "pytorch", "task": "feature-extraction", "allow_remote": true}
|
speech_tokenizer/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3f1ba55a1bb6328b40da379ffa85705fe9d9e9bdf362c474d706fa5d039ee0fc
|
| 3 |
+
size 682293124
|
speech_tokenizer/preprocessor_config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"chunk_length_s": null,
|
| 3 |
+
"feature_extractor_type": "EncodecFeatureExtractor",
|
| 4 |
+
"feature_size": 1,
|
| 5 |
+
"overlap": null,
|
| 6 |
+
"padding_side": "right",
|
| 7 |
+
"padding_value": 0.0,
|
| 8 |
+
"return_attention_mask": true,
|
| 9 |
+
"sampling_rate": 24000
|
| 10 |
+
}
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_bos_token": false,
|
| 3 |
+
"add_prefix_space": false,
|
| 4 |
+
"added_tokens_decoder": {
|
| 5 |
+
"151643": {
|
| 6 |
+
"content": "<|endoftext|>",
|
| 7 |
+
"lstrip": false,
|
| 8 |
+
"normalized": false,
|
| 9 |
+
"rstrip": false,
|
| 10 |
+
"single_word": false,
|
| 11 |
+
"special": true
|
| 12 |
+
},
|
| 13 |
+
"151644": {
|
| 14 |
+
"content": "<|im_start|>",
|
| 15 |
+
"lstrip": false,
|
| 16 |
+
"normalized": false,
|
| 17 |
+
"rstrip": false,
|
| 18 |
+
"single_word": false,
|
| 19 |
+
"special": true
|
| 20 |
+
},
|
| 21 |
+
"151645": {
|
| 22 |
+
"content": "<|im_end|>",
|
| 23 |
+
"lstrip": false,
|
| 24 |
+
"normalized": false,
|
| 25 |
+
"rstrip": false,
|
| 26 |
+
"single_word": false,
|
| 27 |
+
"special": true
|
| 28 |
+
},
|
| 29 |
+
"151646": {
|
| 30 |
+
"content": "<|object_ref_start|>",
|
| 31 |
+
"lstrip": false,
|
| 32 |
+
"normalized": false,
|
| 33 |
+
"rstrip": false,
|
| 34 |
+
"single_word": false,
|
| 35 |
+
"special": true
|
| 36 |
+
},
|
| 37 |
+
"151647": {
|
| 38 |
+
"content": "<|object_ref_end|>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false,
|
| 43 |
+
"special": true
|
| 44 |
+
},
|
| 45 |
+
"151648": {
|
| 46 |
+
"content": "<|box_start|>",
|
| 47 |
+
"lstrip": false,
|
| 48 |
+
"normalized": false,
|
| 49 |
+
"rstrip": false,
|
| 50 |
+
"single_word": false,
|
| 51 |
+
"special": true
|
| 52 |
+
},
|
| 53 |
+
"151649": {
|
| 54 |
+
"content": "<|box_end|>",
|
| 55 |
+
"lstrip": false,
|
| 56 |
+
"normalized": false,
|
| 57 |
+
"rstrip": false,
|
| 58 |
+
"single_word": false,
|
| 59 |
+
"special": true
|
| 60 |
+
},
|
| 61 |
+
"151650": {
|
| 62 |
+
"content": "<|quad_start|>",
|
| 63 |
+
"lstrip": false,
|
| 64 |
+
"normalized": false,
|
| 65 |
+
"rstrip": false,
|
| 66 |
+
"single_word": false,
|
| 67 |
+
"special": true
|
| 68 |
+
},
|
| 69 |
+
"151651": {
|
| 70 |
+
"content": "<|quad_end|>",
|
| 71 |
+
"lstrip": false,
|
| 72 |
+
"normalized": false,
|
| 73 |
+
"rstrip": false,
|
| 74 |
+
"single_word": false,
|
| 75 |
+
"special": true
|
| 76 |
+
},
|
| 77 |
+
"151652": {
|
| 78 |
+
"content": "<|vision_start|>",
|
| 79 |
+
"lstrip": false,
|
| 80 |
+
"normalized": false,
|
| 81 |
+
"rstrip": false,
|
| 82 |
+
"single_word": false,
|
| 83 |
+
"special": true
|
| 84 |
+
},
|
| 85 |
+
"151653": {
|
| 86 |
+
"content": "<|vision_end|>",
|
| 87 |
+
"lstrip": false,
|
| 88 |
+
"normalized": false,
|
| 89 |
+
"rstrip": false,
|
| 90 |
+
"single_word": false,
|
| 91 |
+
"special": true
|
| 92 |
+
},
|
| 93 |
+
"151654": {
|
| 94 |
+
"content": "<|vision_pad|>",
|
| 95 |
+
"lstrip": false,
|
| 96 |
+
"normalized": false,
|
| 97 |
+
"rstrip": false,
|
| 98 |
+
"single_word": false,
|
| 99 |
+
"special": true
|
| 100 |
+
},
|
| 101 |
+
"151655": {
|
| 102 |
+
"content": "<|image_pad|>",
|
| 103 |
+
"lstrip": false,
|
| 104 |
+
"normalized": false,
|
| 105 |
+
"rstrip": false,
|
| 106 |
+
"single_word": false,
|
| 107 |
+
"special": true
|
| 108 |
+
},
|
| 109 |
+
"151656": {
|
| 110 |
+
"content": "<|video_pad|>",
|
| 111 |
+
"lstrip": false,
|
| 112 |
+
"normalized": false,
|
| 113 |
+
"rstrip": false,
|
| 114 |
+
"single_word": false,
|
| 115 |
+
"special": true
|
| 116 |
+
},
|
| 117 |
+
"151657": {
|
| 118 |
+
"content": "<tool_call>",
|
| 119 |
+
"lstrip": false,
|
| 120 |
+
"normalized": false,
|
| 121 |
+
"rstrip": false,
|
| 122 |
+
"single_word": false,
|
| 123 |
+
"special": false
|
| 124 |
+
},
|
| 125 |
+
"151658": {
|
| 126 |
+
"content": "</tool_call>",
|
| 127 |
+
"lstrip": false,
|
| 128 |
+
"normalized": false,
|
| 129 |
+
"rstrip": false,
|
| 130 |
+
"single_word": false,
|
| 131 |
+
"special": false
|
| 132 |
+
},
|
| 133 |
+
"151659": {
|
| 134 |
+
"content": "<|fim_prefix|>",
|
| 135 |
+
"lstrip": false,
|
| 136 |
+
"normalized": false,
|
| 137 |
+
"rstrip": false,
|
| 138 |
+
"single_word": false,
|
| 139 |
+
"special": false
|
| 140 |
+
},
|
| 141 |
+
"151660": {
|
| 142 |
+
"content": "<|fim_middle|>",
|
| 143 |
+
"lstrip": false,
|
| 144 |
+
"normalized": false,
|
| 145 |
+
"rstrip": false,
|
| 146 |
+
"single_word": false,
|
| 147 |
+
"special": false
|
| 148 |
+
},
|
| 149 |
+
"151661": {
|
| 150 |
+
"content": "<|fim_suffix|>",
|
| 151 |
+
"lstrip": false,
|
| 152 |
+
"normalized": false,
|
| 153 |
+
"rstrip": false,
|
| 154 |
+
"single_word": false,
|
| 155 |
+
"special": false
|
| 156 |
+
},
|
| 157 |
+
"151662": {
|
| 158 |
+
"content": "<|fim_pad|>",
|
| 159 |
+
"lstrip": false,
|
| 160 |
+
"normalized": false,
|
| 161 |
+
"rstrip": false,
|
| 162 |
+
"single_word": false,
|
| 163 |
+
"special": false
|
| 164 |
+
},
|
| 165 |
+
"151663": {
|
| 166 |
+
"content": "<|repo_name|>",
|
| 167 |
+
"lstrip": false,
|
| 168 |
+
"normalized": false,
|
| 169 |
+
"rstrip": false,
|
| 170 |
+
"single_word": false,
|
| 171 |
+
"special": false
|
| 172 |
+
},
|
| 173 |
+
"151664": {
|
| 174 |
+
"content": "<|file_sep|>",
|
| 175 |
+
"lstrip": false,
|
| 176 |
+
"normalized": false,
|
| 177 |
+
"rstrip": false,
|
| 178 |
+
"single_word": false,
|
| 179 |
+
"special": false
|
| 180 |
+
},
|
| 181 |
+
"151665": {
|
| 182 |
+
"content": "<tool_response>",
|
| 183 |
+
"lstrip": false,
|
| 184 |
+
"normalized": false,
|
| 185 |
+
"rstrip": false,
|
| 186 |
+
"single_word": false,
|
| 187 |
+
"special": false
|
| 188 |
+
},
|
| 189 |
+
"151666": {
|
| 190 |
+
"content": "</tool_response>",
|
| 191 |
+
"lstrip": false,
|
| 192 |
+
"normalized": false,
|
| 193 |
+
"rstrip": false,
|
| 194 |
+
"single_word": false,
|
| 195 |
+
"special": false
|
| 196 |
+
},
|
| 197 |
+
"151667": {
|
| 198 |
+
"content": "<think>",
|
| 199 |
+
"lstrip": false,
|
| 200 |
+
"normalized": false,
|
| 201 |
+
"rstrip": false,
|
| 202 |
+
"single_word": false,
|
| 203 |
+
"special": false
|
| 204 |
+
},
|
| 205 |
+
"151668": {
|
| 206 |
+
"content": "</think>",
|
| 207 |
+
"lstrip": false,
|
| 208 |
+
"normalized": false,
|
| 209 |
+
"rstrip": false,
|
| 210 |
+
"single_word": false,
|
| 211 |
+
"special": false
|
| 212 |
+
},
|
| 213 |
+
"151669": {
|
| 214 |
+
"content": "<|audio_start|>",
|
| 215 |
+
"lstrip": false,
|
| 216 |
+
"normalized": false,
|
| 217 |
+
"rstrip": false,
|
| 218 |
+
"single_word": false,
|
| 219 |
+
"special": true
|
| 220 |
+
},
|
| 221 |
+
"151670": {
|
| 222 |
+
"content": "<|audio_end|>",
|
| 223 |
+
"lstrip": false,
|
| 224 |
+
"normalized": false,
|
| 225 |
+
"rstrip": false,
|
| 226 |
+
"single_word": false,
|
| 227 |
+
"special": true
|
| 228 |
+
},
|
| 229 |
+
"151671": {
|
| 230 |
+
"content": "<tts_pad>",
|
| 231 |
+
"lstrip": false,
|
| 232 |
+
"normalized": false,
|
| 233 |
+
"rstrip": false,
|
| 234 |
+
"single_word": false,
|
| 235 |
+
"special": true
|
| 236 |
+
},
|
| 237 |
+
"151672": {
|
| 238 |
+
"content": "<tts_text_bos>",
|
| 239 |
+
"lstrip": false,
|
| 240 |
+
"normalized": false,
|
| 241 |
+
"rstrip": false,
|
| 242 |
+
"single_word": false,
|
| 243 |
+
"special": true
|
| 244 |
+
},
|
| 245 |
+
"151673": {
|
| 246 |
+
"content": "<tts_text_eod>",
|
| 247 |
+
"lstrip": false,
|
| 248 |
+
"normalized": false,
|
| 249 |
+
"rstrip": false,
|
| 250 |
+
"single_word": false,
|
| 251 |
+
"special": true
|
| 252 |
+
},
|
| 253 |
+
"151674": {
|
| 254 |
+
"content": "<tts_text_bos_single>",
|
| 255 |
+
"lstrip": false,
|
| 256 |
+
"normalized": false,
|
| 257 |
+
"rstrip": false,
|
| 258 |
+
"single_word": false,
|
| 259 |
+
"special": true
|
| 260 |
+
},
|
| 261 |
+
"151675": {
|
| 262 |
+
"content": "<|audio_pad|>",
|
| 263 |
+
"lstrip": false,
|
| 264 |
+
"normalized": false,
|
| 265 |
+
"rstrip": false,
|
| 266 |
+
"single_word": false,
|
| 267 |
+
"special": true
|
| 268 |
+
}
|
| 269 |
+
},
|
| 270 |
+
"additional_special_tokens": [
|
| 271 |
+
"<|im_start|>",
|
| 272 |
+
"<|im_end|>",
|
| 273 |
+
"<|object_ref_start|>",
|
| 274 |
+
"<|object_ref_end|>",
|
| 275 |
+
"<|box_start|>",
|
| 276 |
+
"<|box_end|>",
|
| 277 |
+
"<|quad_start|>",
|
| 278 |
+
"<|quad_end|>",
|
| 279 |
+
"<|vision_start|>",
|
| 280 |
+
"<|vision_end|>",
|
| 281 |
+
"<|vision_pad|>",
|
| 282 |
+
"<|image_pad|>",
|
| 283 |
+
"<|video_pad|>",
|
| 284 |
+
"<|audio_start|>",
|
| 285 |
+
"<|audio_end|>",
|
| 286 |
+
"<tts_pad>",
|
| 287 |
+
"<tts_text_bos>",
|
| 288 |
+
"<tts_text_bos_single>",
|
| 289 |
+
"<|audio_pad|>"
|
| 290 |
+
],
|
| 291 |
+
"extra_special_tokens": {
|
| 292 |
+
"image_token": "<|image_pad|>",
|
| 293 |
+
"audio_token": "<|audio_pad|>",
|
| 294 |
+
"video_token": "<|video_pad|>",
|
| 295 |
+
"vision_bos_token": "<|vision_start|>",
|
| 296 |
+
"vision_eos_token": "<|vision_end|>",
|
| 297 |
+
"audio_bos_token": "<|audio_start|>",
|
| 298 |
+
"audio_eos_token": "<|audio_end|>"
|
| 299 |
+
},
|
| 300 |
+
"bos_token": null,
|
| 301 |
+
"clean_up_tokenization_spaces": false,
|
| 302 |
+
"eos_token": "<|im_end|>",
|
| 303 |
+
"errors": "replace",
|
| 304 |
+
"model_max_length": 131072,
|
| 305 |
+
"pad_token": "<|endoftext|>",
|
| 306 |
+
"split_special_tokens": false,
|
| 307 |
+
"tokenizer_class": "Qwen2Tokenizer",
|
| 308 |
+
"unk_token": null,
|
| 309 |
+
"image_token": "<|image_pad|>",
|
| 310 |
+
"audio_token": "<|audio_pad|>",
|
| 311 |
+
"video_token": "<|video_pad|>",
|
| 312 |
+
"vision_bos_token": "<|vision_start|>",
|
| 313 |
+
"vision_eos_token": "<|vision_end|>",
|
| 314 |
+
"audio_bos_token": "<|audio_start|>",
|
| 315 |
+
"audio_eos_token": "<|audio_end|>"
|
| 316 |
+
}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
vocence_config.yaml
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PromptTTS settings read by the canonical wrapper and your miner.py.
|
| 2 |
+
# Validator gate 4h.ii requires:
|
| 3 |
+
# - this file exists at the pinned revision
|
| 4 |
+
# - parses as YAML and is a mapping
|
| 5 |
+
# - has a non-empty `model_name` string
|
| 6 |
+
# - `model_name` equals the on-chain `model_name` you commit
|
| 7 |
+
model_name: "might2901/king_05"
|
| 8 |
+
|
| 9 |
+
runtime:
|
| 10 |
+
adapter: "qwen3-tts"
|
| 11 |
+
device_preference: "cuda"
|
| 12 |
+
dtype: "bfloat16"
|
| 13 |
+
use_flash_attention_2: false
|
| 14 |
+
default_language: "English"
|
| 15 |
+
|
| 16 |
+
generation:
|
| 17 |
+
sample_rate: 24000
|
| 18 |
+
max_seconds: 20
|
| 19 |
+
guidance_scale: 1.0
|
| 20 |
+
|
| 21 |
+
io:
|
| 22 |
+
output_format: "wav"
|
| 23 |
+
|
| 24 |
+
limits:
|
| 25 |
+
max_text_chars: 2000
|
| 26 |
+
max_instruction_chars: 600
|