--- license: cc-by-nc-sa-4.0 base_model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign pipeline_tag: text-to-speech library_name: transformers language: - en tags: - tts - prompttts - qwen3-tts - voice-design - vocence --- # vocence_miner_v3 A reliability-and-naturalness pass over the prompt-driven Qwen3-TTS-12Hz-1.7B-VoiceDesign backbone. v3 ships two changes that matter at inference time: **1. Full-sentence generation.** Earlier checkpoints would sometimes render only the first clause of a longer input — the rest of the sentence would be cut off, dropped, or replaced with silence. v3 generates the entire input from start to end, including longer sentences with intermediate clauses, em-dashes, and parenthetical asides. **2. More natural delivery.** Across the same prompt set, v3 produces audibly smoother prosody — fewer flat reads on neutral prompts, less "narrated" surface on short utterances, and more believable breath placement on persona reads. Everything else stays the same: free-form English `instruct`, 24 kHz mono output, single-call inference, no reference audio. --- ## Use it ```bash pip install qwen-tts transformers torch soundfile ``` ```python from qwen_tts import Qwen3TTSModel import soundfile as sf m = Qwen3TTSModel.from_pretrained("magma90909/vocence_miner_v3") wavs, sr = m.generate_voice_design( text="When I got home, the lights were on, the back door was wide open, and somebody had left tea brewing on the kitchen counter.", instruct="A nervous middle-aged man recounting the moment, slightly hushed, slightly fast.", language="english", ) sf.write("out.wav", wavs[0], sr) ``` The example deliberately uses a long, multi-clause sentence — the kind that earlier checkpoints would clip mid-read. --- ## What `instruct` understands | Axis | Working values | |------|----------------| | Gender | male, female | | Pitch | deep, low, medium, high, thin | | Pace | slow, halting, moderate, brisk, fast | | Affect | neutral, happy, sad, angry, fearful, urgent, calm, projected, whispered, sarcastic | | Persona | bedtime storyteller, news anchor, sports announcer, stern parent, weary narrator | Lead with gender on emotion-heavy prompts to avoid timbre drift. --- ## Caveats - English only — other languages were not part of this checkpoint's adaptation set. - Strongly expressive reads (drawn-out sad reads, projected announcer reads) may run slightly less precise on automatic transcription than the base. The trade-off was made deliberately for delivery character. - CC BY-NC-SA 4.0 — research and non-commercial use only. --- ## What's in the repo - `model.safetensors` — merged Talker weights - `speech_tokenizer/` — Qwen3 12 Hz audio codec - `tokenizer.json`, `vocab.json`, `merges.txt`, configs — text-side assets - `miner.py`, `chute_config.yml`, `vocence_config.yaml` — Vocence engine glue (TEE / pro_6000) - `demo.py` — quick smoke test The Vocence files make this repo deployable on **Bittensor SN78 (Vocence)** via the canonical Vocence/Chutes wrapper without modification.