File size: 3,078 Bytes
c59dc3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: cc-by-nc-sa-4.0
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
pipeline_tag: text-to-speech
library_name: transformers
language:
  - en
tags:
  - tts
  - prompttts
  - qwen3-tts
  - voice-design
  - vocence
---

# vocence_miner_v3

A reliability-and-naturalness pass over the prompt-driven Qwen3-TTS-12Hz-1.7B-VoiceDesign backbone. v3 ships two changes that matter at inference time:

**1. Full-sentence generation.** Earlier checkpoints would sometimes render only the first clause of a longer input β€” the rest of the sentence would be cut off, dropped, or replaced with silence. v3 generates the entire input from start to end, including longer sentences with intermediate clauses, em-dashes, and parenthetical asides.

**2. More natural delivery.** Across the same prompt set, v3 produces audibly smoother prosody β€” fewer flat reads on neutral prompts, less "narrated" surface on short utterances, and more believable breath placement on persona reads.

Everything else stays the same: free-form English `instruct`, 24 kHz mono output, single-call inference, no reference audio.

---

## Use it

```bash
pip install qwen-tts transformers torch soundfile
```

```python
from qwen_tts import Qwen3TTSModel
import soundfile as sf

m = Qwen3TTSModel.from_pretrained("magma90909/vocence_miner_v3")

wavs, sr = m.generate_voice_design(
    text="When I got home, the lights were on, the back door was wide open, and somebody had left tea brewing on the kitchen counter.",
    instruct="A nervous middle-aged man recounting the moment, slightly hushed, slightly fast.",
    language="english",
)
sf.write("out.wav", wavs[0], sr)
```

The example deliberately uses a long, multi-clause sentence β€” the kind that earlier checkpoints would clip mid-read.

---

## What `instruct` understands

| Axis | Working values |
|------|----------------|
| Gender | male, female |
| Pitch | deep, low, medium, high, thin |
| Pace | slow, halting, moderate, brisk, fast |
| Affect | neutral, happy, sad, angry, fearful, urgent, calm, projected, whispered, sarcastic |
| Persona | bedtime storyteller, news anchor, sports announcer, stern parent, weary narrator |

Lead with gender on emotion-heavy prompts to avoid timbre drift.

---

## Caveats

- English only β€” other languages were not part of this checkpoint's adaptation set.
- Strongly expressive reads (drawn-out sad reads, projected announcer reads) may run slightly less precise on automatic transcription than the base. The trade-off was made deliberately for delivery character.
- CC BY-NC-SA 4.0 β€” research and non-commercial use only.

---

## What's in the repo

- `model.safetensors` β€” merged Talker weights
- `speech_tokenizer/` β€” Qwen3 12 Hz audio codec
- `tokenizer.json`, `vocab.json`, `merges.txt`, configs β€” text-side assets
- `miner.py`, `chute_config.yml`, `vocence_config.yaml` β€” Vocence engine glue (TEE / pro_6000)
- `demo.py` β€” quick smoke test

The Vocence files make this repo deployable on **Bittensor SN78 (Vocence)** via the canonical Vocence/Chutes wrapper without modification.