VoiceSculptor-VD / docs /voice_design_en.md
ASLP-lab's picture
Upload 8 files (#1)
7dad981

Voice Design Guide (Voice Design README)

Overview

This guide provides instructions for creating high-quality voice descriptions (voice_prompt) to generate voices that meet specific requirements. The voice description serves as the blueprint for voice design and directly determines the quality of the generated voice.


Technical Constraints

Item Description
Length Limit Each voice_prompt ≤ 200 characters
Supported Languages Chinese only. English is not supported in the current version and will be added in future updates

Five Core Principles

1️⃣ Be Specific, Not Vague

✅ Recommended: Use perceptible and concrete voice attributes

  • Pitch: low, high, bright, rich
  • Speaking rate: fast, slow, rapid, steady
  • Timbre: magnetic, husky, smooth, clear

❌ Avoid: “Nice”, “normal”, “good” (too subjective and uninformative)


2️⃣ Multi-Dimensional, Not Single-Attribute

✅ Recommended: Combine at least 3–4 dimensions to create a vivid voice profile

  • Persona (usage scenario) + gender + age + pitch + speaking rate + volume + timbre + emotion

❌ Avoid: Only “female voice” or only “low-pitched” (too generic, lacks distinctiveness)


3️⃣ Objective, Not Subjective

✅ Recommended: Describe physical and acoustic characteristics

  • “Slightly high-pitched with energetic delivery”
  • “Slow speaking rate with clear articulation”

❌ Avoid: “My favorite voice”, “This voice sounds great”


4️⃣ Original, Not Imitative

⚠️ Copyright Notice: Descriptions such as “sounds like XX celebrity” or “imitates XX actor” are prohibited ✅ Recommended: Describe voice characteristics directly rather than referencing specific individuals


5️⃣ Concise, Not Redundant

✅ Recommended: Ensure every word conveys meaningful information ❌ Avoid: “Very, very good voice”, “Extremely, extremely gentle”


Reference Dimensions for Voice Description

Based on high-quality examples, we recommend composing voice prompts using the following dimensions:

Dimension Example Options
Persona (Usage Scenario) News broadcasting, advertising voice-over, audiobooks, animated characters, documentary narration
Gender Male, Female
Age Child (~8 years), Young adult (20–30), Middle-aged (40–50), Elderly
Personality Traits Lively, calm, gentle, intellectual, cute, serious
Speaking Rate & Rhythm Fast, slow, moderate, urgent, steady
Intonation Style Rising, neutral, passionate, relaxed
Timbre Deep and magnetic, crisp and bright, husky and warm, youthful

High-Quality Examples

✅ Recommended Templates

Example 1: Poetry Recitation

“A male modern poetry reciter with a deep, magnetic low voice, delivering poetry with strong rhythmic pauses, powerful volume, and intense emotional expression.”

Example 2: News Style

“A female news anchor speaking standard Mandarin with a clear and bright mid-to-high pitch, steady professional pacing, strong volume, and a neutral, objective tone.”

Example 3: Advertising Voice-Over

“A male voice for liquor brand advertising, featuring a rich and weathered timbre, slow and bold speaking rate, strong volume, conveying a sense of history and masculinity.”


Common Mistakes and Improvements

Type ❌ Not Recommended ✅ Improved Version
Too Generic “Female voice, nice” “Young female voice with a clear pitch and moderate speaking rate”
Subjective Evaluation “A great-sounding voice” “Bright timbre with strong expressiveness”
Single Dimension “Low-pitched male voice” “Middle-aged male with a low pitch, slow pacing, suitable for documentaries”
Redundant Wording “Very, very gentle voice” “Gentle and intellectual female voice”
Imitation Request “Sounds like XX celebrity” Prohibited — describe objective voice traits instead

Quick Checklist

Before submitting a voice_prompt, make sure that:

  • Length ≤ 200 characters
  • At least 3 different descriptive dimensions are included
  • No subjective evaluation words (e.g., “nice”, “great”, “favorite”)
  • No references to real individuals or imitation requests
  • No repetitive or exaggerated wording
  • Usage scenario is clearly defined
  • All descriptors are perceptible and concrete