Enhanced Vocence Miner - Time-Budget Management Edition

Built on top of magma90909/vocence_miner_v3 (current top performer), this enhanced version adds adaptive time-budget management and quality-based candidate selection for improved performance on the Vocence network.

Key Enhancements

1. Time-Budget Management. Intelligently adapts generation strategy based on inference speed:

  • Fast path (<40s): Generates all 6 candidates + scores each β†’ returns best
  • Moderate path (40-65s): Generates 2 candidates + scores β†’ returns best
  • Slow path (>65s): Returns first candidate immediately (no scoring overhead)

2. Quality-Based Selection. When multiple candidates are generated, uses composite scoring:

  • UTMOSv2 (35%): Naturalness and perceptual quality
  • Whisper WER (65%): Script accuracy and transcription alignment

3. Validation Filtering. Pre-filters candidates before scoring:

  • Duration checks (2-29.5 seconds)
  • Audio quality checks (RMS, peak amplitude)
  • Waveform validity checks

Expected Performance

  • Pass rate: 90-95% (higher than base model alone)
  • Average score: 0.93-0.95 composite
  • Latency: Adaptive 40-130s (depends on text complexity)

Base Model Features

The underlying model (v3) provides:

1. Full-sentence generation. Earlier checkpoints would sometimes render only the first clause of a longer input β€” the rest of the sentence would be cut off, dropped, or replaced with silence. v3 generates the entire input from start to end, including longer sentences with intermediate clauses, em-dashes, and parenthetical asides.

2. More natural delivery. Across the same prompt set, v3 produces audibly smoother prosody β€” fewer flat reads on neutral prompts, less "narrated" surface on short utterances, and more believable breath placement on persona reads.


Use it

pip install qwen-tts transformers torch soundfile
from qwen_tts import Qwen3TTSModel
import soundfile as sf

m = Qwen3TTSModel.from_pretrained("magma90909/vocence_miner_v3")

wavs, sr = m.generate_voice_design(
    text="When I got home, the lights were on, the back door was wide open, and somebody had left tea brewing on the kitchen counter.",
    instruct="A nervous middle-aged man recounting the moment, slightly hushed, slightly fast.",
    language="english",
)
sf.write("out.wav", wavs[0], sr)

The example deliberately uses a long, multi-clause sentence β€” the kind that earlier checkpoints would clip mid-read.


What instruct understands

Axis Working values
Gender male, female
Pitch deep, low, medium, high, thin
Pace slow, halting, moderate, brisk, fast
Affect neutral, happy, sad, angry, fearful, urgent, calm, projected, whispered, sarcastic
Persona bedtime storyteller, news anchor, sports announcer, stern parent, weary narrator

Lead with gender on emotion-heavy prompts to avoid timbre drift.


Caveats

  • English only β€” other languages were not part of this checkpoint's adaptation set.
  • Strongly expressive reads (drawn-out sad reads, projected announcer reads) may run slightly less precise on automatic transcription than the base. The trade-off was made deliberately for delivery character.
  • CC BY-NC-SA 4.0 β€” research and non-commercial use only.

What's in the repo

  • model.safetensors β€” merged Talker weights
  • speech_tokenizer/ β€” Qwen3 12 Hz audio codec
  • tokenizer.json, vocab.json, merges.txt, configs β€” text-side assets
  • miner.py, chute_config.yml, vocence_config.yaml β€” Vocence engine glue (TEE / pro_6000)
  • demo.py β€” quick smoke test

The Vocence files make this repo deployable on Bittensor SN78 (Vocence) via the canonical Vocence/Chutes wrapper without modification.

Downloads last month
23
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ichiro1007/vocence_enhanced_miner

Finetuned
(39)
this model