aiseosae
/

good_v3

text-generation

Model card Files Files and versions

good_v3 / README.md

aiseosae's picture

Upload folder using huggingface_hub

c59dc3a verified 19 days ago

|

history blame contribute delete

3.08 kB

	---
	license: cc-by-nc-sa-4.0
	base_model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
	pipeline_tag: text-to-speech
	library_name: transformers
	language:
	- en
	tags:
	- tts
	- prompttts
	- qwen3-tts
	- voice-design
	- vocence
	---

	# vocence_miner_v3

	A reliability-and-naturalness pass over the prompt-driven Qwen3-TTS-12Hz-1.7B-VoiceDesign backbone. v3 ships two changes that matter at inference time:

	1. Full-sentence generation. Earlier checkpoints would sometimes render only the first clause of a longer input — the rest of the sentence would be cut off, dropped, or replaced with silence. v3 generates the entire input from start to end, including longer sentences with intermediate clauses, em-dashes, and parenthetical asides.

	2. More natural delivery. Across the same prompt set, v3 produces audibly smoother prosody — fewer flat reads on neutral prompts, less "narrated" surface on short utterances, and more believable breath placement on persona reads.

	Everything else stays the same: free-form English `instruct`, 24 kHz mono output, single-call inference, no reference audio.

	---

	## Use it

	```bash
	pip install qwen-tts transformers torch soundfile
	```

	```python
	from qwen_tts import Qwen3TTSModel
	import soundfile as sf

	m = Qwen3TTSModel.from_pretrained("magma90909/vocence_miner_v3")

	wavs, sr = m.generate_voice_design(
	text="When I got home, the lights were on, the back door was wide open, and somebody had left tea brewing on the kitchen counter.",
	instruct="A nervous middle-aged man recounting the moment, slightly hushed, slightly fast.",
	language="english",
	)
	sf.write("out.wav", wavs[0], sr)
	```

	The example deliberately uses a long, multi-clause sentence — the kind that earlier checkpoints would clip mid-read.

	---

	## What `instruct` understands

	\| Axis \| Working values \|
	\|------\|----------------\|
	\| Gender \| male, female \|
	\| Pitch \| deep, low, medium, high, thin \|
	\| Pace \| slow, halting, moderate, brisk, fast \|
	\| Affect \| neutral, happy, sad, angry, fearful, urgent, calm, projected, whispered, sarcastic \|
	\| Persona \| bedtime storyteller, news anchor, sports announcer, stern parent, weary narrator \|

	Lead with gender on emotion-heavy prompts to avoid timbre drift.

	---

	## Caveats

	- English only — other languages were not part of this checkpoint's adaptation set.
	- Strongly expressive reads (drawn-out sad reads, projected announcer reads) may run slightly less precise on automatic transcription than the base. The trade-off was made deliberately for delivery character.
	- CC BY-NC-SA 4.0 — research and non-commercial use only.

	---

	## What's in the repo

	- `model.safetensors` — merged Talker weights
	- `speech_tokenizer/` — Qwen3 12 Hz audio codec
	- `tokenizer.json`, `vocab.json`, `merges.txt`, configs — text-side assets
	- `miner.py`, `chute_config.yml`, `vocence_config.yaml` — Vocence engine glue (TEE / pro_6000)
	- `demo.py` — quick smoke test

	The Vocence files make this repo deployable on Bittensor SN78 (Vocence) via the canonical Vocence/Chutes wrapper without modification.