YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Vocence PromptTTS Miner - Base Parler-TTS

This repository contains a Vocence-compliant miner using the base Parler-TTS Mini v1 model.

Model Details

  • Model: parler-tts/parler-tts-mini-v1 (880M parameters)
  • Source: HuggingFace (loaded at runtime)
  • Sample Rate: 44.1kHz
  • Instruction Format: Vocence-compliant voice trait descriptions
  • No fine-tuning: Uses pretrained base model

Voice Traits Supported

The model follows Vocence instruction format:

  • Gender: male, female, neutral
  • Pitch: low, mid, high
  • Speed: slow, normal, fast
  • Age: child, young_adult, adult, senior
  • Emotion: neutral, happy, sad, angry, calm, excited, serious, fearful
  • Tone: warm, cold, friendly, formal, casual, authoritative
  • Accent: us, uk, au, in, neutral, other

Instruction Format

A {emotion} {age_group} {gender} voice with {pitch} pitch, speaking at {speed} speed in a {tone} manner with a {accent} accent.

Example

A calm adult male voice with mid pitch, speaking at normal speed in a casual manner with a US accent.

Usage

The miner implements the Vocence contract:

from pathlib import Path
from miner import Miner

# Initialize (model downloads from HuggingFace on first run)
miner = Miner(Path("./"))

# Warmup
miner.warmup()

# Generate
instruction = "A calm adult male voice with mid pitch, speaking at normal speed in a casual manner with a neutral accent."
text = "Hello world, this is a test of the Vocence PromptTTS system."
waveform, sample_rate = miner.generate_wav(instruction, text)

Deployment

Requirements

  • GPU with 16GB+ VRAM
  • Python 3.12+
  • Dependencies: torch, transformers, parler-tts, etc. (see chute_config.yml)

Quick Deploy to Chutes

  1. Upload this repo to HuggingFace
  2. Render the Vocence canonical wrapper template with your repo details
  3. Build and deploy to Chutes
  4. Register on Vocence subnet

See DEPLOYMENT_GUIDE.md for detailed steps.

Local A/B Evaluation

Use eval_ab.py to compare prompt-conditioning strategies and quickly tune decoding:

  • python eval_ab.py --mode raw
  • python eval_ab.py --mode conditioned
  • python eval_ab.py --mode both

Outputs are written to eval_outputs/:

  • per-case WAV files for listening tests
  • metrics.csv with duration, RMS, clipping ratio, and silence ratio

Performance Characteristics

Strengths

  • High quality: Parler-TTS base model quality
  • Proven reliability: Thoroughly tested base model
  • Fast deployment: No training required
  • Script accuracy: Good transcription quality
  • Naturalness: Human-like speech

Limitations

  • ⚠️ No fine-tuning: Not optimized for specific voice characteristics
  • ⚠️ Generic traits: May not hit all trait combinations perfectly
  • ⚠️ Baseline performance: Competitive but not cutting-edge

Expected Vocence Score

Estimated: 70-80%

  • Script accuracy (30%): ~80-85% ⭐⭐⭐⭐
  • Naturalness (15%): ~85-90% ⭐⭐⭐⭐
  • Trait control (55%): ~65-75% ⭐⭐⭐

To reach 90%+ score, consider:

  1. Fine-tuning on multi-speaker datasets
  2. Adding voice trait classification training
  3. Optimizing for edge cases

API Contract

The miner provides:

  • __init__(path_hf_repo: Path) - Initialize with repo path
  • warmup() - Optional warmup call
  • generate_wav(instruction: str, text: str) -> tuple[np.ndarray, int] - Generate audio

Output format:

  • Mono float32 numpy array
  • 44.1kHz sample rate
  • Values in range [-1, 1]

License

Based on Parler-TTS (Apache 2.0 License)

Support

For issues or questions:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support