Vocence PromptTTS Miner - Base Parler-TTS

This repository contains a Vocence-compliant miner using the base Parler-TTS Mini v1 model.

Model Details

Model: parler-tts/parler-tts-mini-v1 (880M parameters)
Source: HuggingFace (loaded at runtime)
Sample Rate: 44.1kHz
Instruction Format: Vocence-compliant voice trait descriptions
No fine-tuning: Uses pretrained base model

Voice Traits Supported

The model follows Vocence instruction format:

Gender: male, female, neutral
Pitch: low, mid, high
Speed: slow, normal, fast
Age: child, young_adult, adult, senior
Emotion: neutral, happy, sad, angry, calm, excited, serious, fearful
Tone: warm, cold, friendly, formal, casual, authoritative
Accent: us, uk, au, in, neutral, other

Instruction Format

A {emotion} {age_group} {gender} voice with {pitch} pitch, speaking at {speed} speed in a {tone} manner with a {accent} accent.

Example

A calm adult male voice with mid pitch, speaking at normal speed in a casual manner with a US accent.

Usage

The miner implements the Vocence contract:

from pathlib import Path
from miner import Miner

# Initialize (model downloads from HuggingFace on first run)
miner = Miner(Path("./"))

# Warmup
miner.warmup()

# Generate
instruction = "A calm adult male voice with mid pitch, speaking at normal speed in a casual manner with a neutral accent."
text = "Hello world, this is a test of the Vocence PromptTTS system."
waveform, sample_rate = miner.generate_wav(instruction, text)

Deployment

Requirements

GPU with 16GB+ VRAM
Python 3.12+
Dependencies: torch, transformers, parler-tts, etc. (see chute_config.yml)

Quick Deploy to Chutes

Upload this repo to HuggingFace
Render the Vocence canonical wrapper template with your repo details
Build and deploy to Chutes
Register on Vocence subnet

See DEPLOYMENT_GUIDE.md for detailed steps.

Local A/B Evaluation

Use eval_ab.py to compare prompt-conditioning strategies and quickly tune decoding:

python eval_ab.py --mode raw
python eval_ab.py --mode conditioned
python eval_ab.py --mode both

Outputs are written to eval_outputs/:

per-case WAV files for listening tests
metrics.csv with duration, RMS, clipping ratio, and silence ratio

Performance Characteristics

Strengths

✅ High quality: Parler-TTS base model quality
✅ Proven reliability: Thoroughly tested base model
✅ Fast deployment: No training required
✅ Script accuracy: Good transcription quality
✅ Naturalness: Human-like speech

Limitations

⚠️ No fine-tuning: Not optimized for specific voice characteristics
⚠️ Generic traits: May not hit all trait combinations perfectly
⚠️ Baseline performance: Competitive but not cutting-edge

Expected Vocence Score

Estimated: 70-80%

Script accuracy (30%): ~80-85% ⭐⭐⭐⭐
Naturalness (15%): ~85-90% ⭐⭐⭐⭐
Trait control (55%): ~65-75% ⭐⭐⭐

To reach 90%+ score, consider:

Fine-tuning on multi-speaker datasets
Adding voice trait classification training
Optimizing for edge cases

API Contract

The miner provides:

__init__(path_hf_repo: Path) - Initialize with repo path
warmup() - Optional warmup call
generate_wav(instruction: str, text: str) -> tuple[np.ndarray, int] - Generate audio

Output format:

Mono float32 numpy array
44.1kHz sample rate
Values in range [-1, 1]

License

Based on Parler-TTS (Apache 2.0 License)

Support

For issues or questions:

Parler-TTS: https://github.com/huggingface/parler-tts
Vocence docs: Check subnet documentation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support