minerTTS / README.md
aiseosae's picture
Upload folder using huggingface_hub
87c19a6 verified
---
license: cc-by-nc-sa-4.0
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
pipeline_tag: text-to-speech
library_name: transformers
language:
- en
tags:
- tts
- qwen3-tts
- voice-design
- prompttts
- vocence
- bittensor
---
Inference uses **`qwen_tts.Qwen3TTSModel`**, loaded from the repo root via `from_pretrained(this_folder)`.
## Layout
| Path | Role |
|------|------|
| `config.json`, weights, tokenizer, codec dirs | Qwen3-TTS snapshot (as shipped by the upstream model card) |
| `miner.py` | Vocence engine: `Miner`, `warmup()`, `generate_wav(instruction, text)` |
| `vocence_config.yaml` | Device, dtype, caps, language |
| `chute_config.yml` | Chutes image / GPU / scaling / TEE |
| `demo.py` | Optional local smoke test (if present) |
## Vocence API
Validators call your deployed chute with JSON shaped like:
```json
{
"text": "Words to speak.",
"instruction": "gender: male | pitch: mid | speed: normal | age_group: adult | emotion: neutral | tone: casual | accent: us"
}
```
The miner forwards **`text`** → `generate_voice_design(..., text=...)` and **`instruction`** → `instruct=...`, using **`language`** from config (default English).
## Configure (`vocence_config.yaml`)
| Area | Keys |
|------|------|
| Runtime | `device_preference` (`cuda` / `cpu`), `dtype` (`bfloat16` / `float32`), `use_flash_attention_2`, `default_language` |
| Generation | `sample_rate` (e.g. 24000), `max_seconds` |
| Limits | `max_text_chars`, `max_instruction_chars`, `default_language` |
Warmup runs one short `generate_voice_design` with a **180 s** timeout.
## Local quick test
Install PyTorch (CUDA if available), then:
```bash
pip install "qwen-tts" pyyaml soundfile numpy
```
```python
from pathlib import Path
from miner import Miner
miner = Miner(Path("."))
miner.warmup()
wave, sr = miner.generate_wav(
instruction="A calm, clear narrator, neutral US accent.",
text="Hello — this is a short synthesis check.",
)
```
Or load the class directly from transformers-style layout:
```python
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(".") # or your HF repo id
wavs, sr = model.generate_voice_design(
text="Hello fellas.",
instruct="Cute voice.",
language="english",
)
```
Replace `"."` with your HF repo id after upload, e.g. `"your-org/your-repo"`.
## Chutes / Vocence deploy
1. Push this layout to a Hugging Face **model** repo; pin a **commit SHA** for `VOCENCE_REVISION`.
2. Render the canonical Vocence chute script with `VOCENCE_REPO`, `VOCENCE_REVISION`, `VOCENCE_CHUTES_USER`, `VOCENCE_CHUTE_ID`.
3. `chutes build … --wait` then `chutes deploy … --accept-fee`.
4. Commit on chain: `model_name`, `model_revision` (HF SHA), `chute_id` (UUID from Chutes).
Chute **name** must contain **`vocence`** (case-insensitive). See **`miner_sample/MINER_GUIDE.md`** in the Vocence repo.
## Training / fine-tuning
Fine-tuning is done **outside** Chutes on your own GPU; export a full snapshot compatible with **`Qwen3TTSModel.from_pretrained(...)`**, then replace weights in this repo layout and push a new revision.
## License
**CC BY-NC-SA 4.0** — see the license file in this repo. Respect upstream Qwen / Alibaba terms for the base checkpoint.