Voice-Taxonomy-57

57-dimension voice taxonomy classifier that analyzes speech audio across temporal dynamics, prosody, voice quality, resonance, style, and paralinguistic features.

Each dimension predicts a value on a 0–6 scale with human-readable tags. The model achieves 76.2% within-1-class accuracy (±1) across all 57 dimensions.

Quick Start

from inference import VoiceTaxonomy57

# Load model
model = VoiceTaxonomy57.from_pretrained("laion/Voice-Taxonomy-57")

# Predict from audio file
result = model.predict("speech.wav")

# Print all 57 dimension predictions
for dim, pred in sorted(result.items()):
    print(f"{dim}: {pred['value']} — {pred['tag_short']} ({pred['confidence']:.0%})")

# Get compact tag string
print(model.format_tags(result, format="short"))
# → "peak adult vigor, baseline alert present, completely static unchanging, ..."

# Batch inference
results = model.predict(["a.wav", "b.mp3", "c.flac"], batch_size=16)

CLI Usage

# JSON output
python inference.py --input audio_folder/ --output results.json --batch-size 16

# Human-readable tags
python inference.py --input speech.wav --format tags-short

# CPU inference
python inference.py --input speech.wav --device cpu --fp32

Architecture

Audio (any format) → ffmpeg decode → 16kHz mono
  → WhisperFeatureExtractor → mel spectrogram
  → BUD-E-Whisper V1.0 encoder → [B, 1500, 768]
  → BUD-E-Whisper V1.1 encoder → [B, 1500, 768]
  → Duration-aware frame truncation
  → Split first half / second half
  → Mean-pool each half per encoder
  → Concatenate: [V1.1_first, V1.1_second, V1.0_first, V1.0_second] → [B, 3072]
  → Per-dimension PCA(96) → [B, 96]
  → Per-dimension MLP(96 → 96 → n_classes) → prediction (0–6)

Encoder: Two fine-tuned Whisper-small encoders (BUD-E-Whisper V1.0 and V1.1) produce complementary voice representations.

Feature extraction: Each encoder's hidden states are split at the temporal midpoint and mean-pooled, yielding 4 vectors of 768 dims each (3072 total). This captures both early and late voice characteristics.

Classification: Per-dimension PCA reduces 3072 → 96 dims, then a small MLP (96 → ReLU → 96 → n_classes) produces class logits. Total: ~57 × 10K ≈ 570K trainable parameters.

Integration with Empathic-Insight-Voice-Plus

Both pipelines share the BUD-E-Whisper V1.0 encoder. To avoid running it twice:

from transformers import WhisperModel, WhisperFeatureExtractor
from inference import VoiceTaxonomy57

# Load taxonomy model without auto-loading encoders
taxonomy = VoiceTaxonomy57.from_pretrained("laion/Voice-Taxonomy-57", load_encoders=False)

# Load shared V1.0 encoder (used by both pipelines)
v10 = WhisperModel.from_pretrained("laion/BUD-E-Whisper").encoder.cuda().eval()
v11 = WhisperModel.from_pretrained("laion/BUD-E-Whisper_V1.1").encoder.cuda().eval()

# Run encoders once
fe = WhisperFeatureExtractor.from_pretrained("laion/BUD-E-Whisper")
mel = fe(waveforms, sampling_rate=16000, return_tensors="pt").input_features.cuda()
with torch.no_grad():
    v10_hidden = v10(mel).last_hidden_state
    v11_hidden = v11(mel).last_hidden_state

# Feed V1.0 hidden states to Empathic-Insight-Voice-Plus emotion pipeline
# (55 emotion MLPs + 4 quality MLPs + whisper decoder for captions)
# ... emotion_results = empathic_model.predict_from_encoder(v10_hidden) ...

# Feed both to taxonomy pipeline (no re-encoding needed)
taxonomy_results = taxonomy.predict_from_encoder_outputs(
    v10_hidden_states=v10_hidden,
    v11_hidden_states=v11_hidden,
    durations=[len(wf) / 16000 for wf in waveforms],
)
# Combined: 55 emotions + 4 quality + 57 taxonomy = 116 voice annotations

Performance

Overall

Metric	Value
Dimensions	57
Mean exact accuracy	47.6%
Mean ±1 accuracy	76.2%
Tier A (±1 ≥ 85%)	15 dims
Tier B (±1 ≥ 70%)	26 dims
Tier C (±1 ≥ 55%)	14 dims
Tier D (±1 < 55%)	2 dims

Per-Dimension Accuracy

Dim	Name	Classes	Exact	±1	Tier
AGEV	Age of Voice	7	25.7%	68.6%	C
AROU	Arousal	7	54.3%	82.9%	B
ARSH	Arousal Shift	7	62.9%	88.6%	A
ATCK	Attack	7	60.0%	97.1%	A
BKGN	Background Noise	7	37.1%	68.6%	C
BRGT	Brightness	6	48.3%	79.3%	B
CHNK	Chunking	7	42.4%	69.7%	C
CLRT	Articulation	7	40.0%	71.4%	B
COGL	Cognitive Load	7	40.0%	60.0%	C
DARC	Dynamic Arc	7	55.9%	67.6%	C
DFLU	Disfluency	7	51.4%	82.9%	B
EMPH	Emphasis	7	51.4%	74.3%	B
ESTH	Aesthetic Quality	7	48.6%	85.7%	A
EXPL	Explicitness	7	28.6%	77.1%	B
FOCS	Focus/Engagement	7	42.9%	80.0%	B
FULL	Fullness/Body	7	23.1%	61.5%	C
GEND	Gender Presentation	7	42.9%	71.4%	B
HARM	Harmonicity	7	54.8%	71.0%	B
METL	Metallicness	7	52.9%	82.4%	B
RANG	Pitch Range	6	46.7%	76.7%	B
RCQL	Recording Quality	7	22.9%	48.6%	D
REGS	Register	7	65.7%	88.6%	A
RESP	Respiratory Audibility	7	57.1%	85.7%	A
ROUG	Roughness	6	51.7%	75.9%	B
R_CHST	Chest Resonance	7	67.6%	88.2%	A
R_HEAD	Head Resonance	7	29.4%	73.5%	B
R_MASK	Mask Resonance	7	68.8%	90.6%	A
R_MIXD	Mixed Resonance	6	36.7%	66.7%	C
R_NASL	Nasal Resonance	6	45.8%	75.0%	B
R_ORAL	Oral Resonance	7	37.5%	56.2%	C
R_THRT	Throat Resonance	7	48.6%	77.1%	B
SMTH	Smoothness	7	57.1%	88.6%	A
STNC	Stance	7	42.9%	74.3%	B
STRU	Structure	7	48.4%	90.3%	A
S_ASMR	ASMR Style	7	56.2%	75.0%	B
S_AUTH	Authoritative Style	7	64.7%	97.1%	A
S_CART	Cartoon Style	7	45.7%	85.7%	A
S_CASU	Casual Style	7	31.0%	58.6%	C
S_CONV	Conversational Style	7	48.5%	84.8%	B
S_DRAM	Dramatic Style	7	25.7%	42.9%	D
S_FORM	Formality	7	60.6%	90.9%	A
S_MONO	Monologue Style	7	62.9%	80.0%	B
S_NARR	Narration Style	7	41.9%	80.6%	B
S_NEWS	Newscaster Style	7	42.9%	77.1%	B
S_PLAY	Playfulness	7	31.4%	60.0%	C
S_RANT	Rant Style	7	54.3%	85.7%	A
S_STRY	Storytelling Style	7	60.6%	84.8%	B
S_TECH	Technical Style	7	58.1%	83.9%	B
S_WHIS	Whisper Style	7	44.1%	61.8%	C
TEMP	Tempo	7	42.4%	63.6%	C
TENS	Tension	7	56.2%	75.0%	B
VALN	Valence	7	60.0%	77.1%	B
VALS	Valence Shift	7	28.6%	62.9%	C
VFLX	Velocity Flex	7	57.1%	94.3%	A
VOLT	Volatility	7	57.1%	77.1%	B
VULN	Vulnerability	7	31.4%	57.1%	C
WARM	Warmth	7	60.7%	89.3%	A

All 57 Dimensions

Temporal Dynamics (6 dimensions)

TEMP — Tempo (±1: 63.6%) Mechanical speed of word/syllable production.

Value	Short Tag	Description
0	glacially slow	Glacially slow, syllables stretched to breaking point
1	heavily deliberate	Unusually deliberate and labored word production
2	relaxed unhurried	Relaxed, slightly below conversational average
3	standard conversational	Standard everyday speech tempo
4	brisk elevated	Brisk, high-engagement forward push
5	noticeably compressed fast	Words compressed, high-density acoustic stream
6	blistering hyper-accelerated	Absolute human limit of linguistic speed

CHNK — Chunking (±1: 69.7%) Breath unit grouping and pause frequency.

Value	Short Tag	Description
0	severely fragmented syllables	Single syllables broken by massive gaps
1	very choppy stop-go	Short choppy two-word bursts
2	consistently shorter pedantic	Cautious, highly separated word groupings
3	naturally medium balanced	Natural sentence-sized breath groups
4	noticeably extended sweeping	Extended multi-sentence breath units
5	very long dense	Very long dense streams with brief inhalations
6	massive continuous unbroken	Continuous wall of words, no pauses

SMTH — Smoothness (±1: 88.6%) Timing regularity and transition fluidity.

Value	Short Tag	Description
0	completely chaotic spasmodic	Chaotic, spasmodic timing with random glitches
1	sharply detached staccato	Sharp staccato machine-gun-like delivery
2	uneven bumpy clumsy	Uneven, bumpy with micro-hesitations
3	standard naturally flexible	Standard natural rhythm with organic flexibility
4	noticeably consistent practiced	Consistent, practiced, professional timing
5	flowing silky legato	Silky legato with seamless transitions
6	mathematically perfect metronomic	Mathematically perfect, metronomic timing

VFLX — Velocity Flex (±1: 94.3%) Change in speech speed across the clip.

Value	Short Tag	Description
0	massive deceleration grinding	Massive deceleration, grinding to a halt
1	heavy sustained slowdown	Heavy, sustained slowdown
2	subtle natural easing	Subtle natural easing of pace
3	perfectly locked steady	Perfectly steady, locked tempo
4	subtle forward lean	Subtle forward acceleration
5	clear sustained acceleration	Clear sustained acceleration
6	extreme acceleration explosion	Extreme acceleration explosion

ARSH — Arousal Shift (±1: 88.6%) Change in autonomic energy across the clip.

Value	Short Tag	Description
0	total catastrophic collapse	From high activation to complete collapse
1	massive rapid de-escalation	Significant drop in energy
2	subtle gentle settling	Comforting settling of tension
3	completely static unchanging	Constant arousal level
4	subtle engaging perking	Light rising alertness
5	clear aggressive escalation	Aggressive energy escalation
6	violent explosive detonation	0-to-100 explosion into panic

DARC — Dynamic Arc (±1: 67.6%) Loudness trajectory across the clip.

Value	Short Tag	Description
0	total catastrophic fade	Fade from loud to inaudible
1	clear pronounced diminuendo	Clear diminuendo
2	gentle natural softening	Gentle natural softening
3	absolutely flatlined constant	Perfectly constant volume
4	controlled satisfying bell	Controlled bell-curve shape
5	clear aggressive crescendo	Aggressive crescendo
6	impossibly extreme violent	Extreme volume explosion

Prosody & Pitch (4 dimensions)

RANG — Pitch Range (±1: 76.7%) Vertical movement of fundamental frequency.

Value	Short Tag	Description
0	perfectly flat monotone	Zero pitch variation, pure drone
1	severely suppressed	Microscopic pitch movement
2	tightly restrained narrow	Narrow, controlled pitch window
3	naturally balanced	Standard conversational intonation
4	expressively wide	Wide, colorful pitch movement
5	highly melodic sweeping	Sweeping dramatic pitch jumps
6	wildly operatic extreme	Wild operatic pitch extremes

EMPH — Emphasis (±1: 74.3%) Word stress and informational hierarchy.

Value	Short Tag	Description
0	flat monotone	Every word identical weight
1	barely stressed	Nearly undetectable stress shifts
2	softly highlighted	Light, polite word highlighting
3	naturally stressed	Clear natural emphasis pattern
4	strongly marked	Strong acoustic hierarchy
5	aggressively punched	Aggressive percussive stress
6	violently explosive	Violent explosive emphasis

REGS — Register (±1: 88.6%) Vocal register from bass to soprano.

Value	Short Tag	Description
0	basso profondo extreme	Extreme low bass register
1	baritone grounded warm	Warm grounded baritone
2	tenor bright lifted	Bright lifted tenor
3	bridged contralto countertenor	Bridged middle register
4	mezzo-soprano balanced	Balanced mezzo-soprano
5	soprano bright brilliant	Bright brilliant soprano
6	coloratura whistle extreme	Extreme high whistle register

VOLT — Volatility (±1: 77.1%) Stability/instability of vocal parameters.

Value	Short Tag	Description
0	absolutely frozen static	Frozen, perfectly static voice
1	highly steady unshakeable	Rock-steady, unshakeable
2	stable minor shifts	Stable with minor shifts
3	natural organic breathing	Natural organic variation
4	minor emotional flickering	Emotional flickering
5	highly unstable jarring	Highly unstable and jarring
6	completely chaotic cycling	Completely chaotic cycling

Articulation & Fluency (4 dimensions)

CLRT — Articulation (±1: 71.4%) Consonant/vowel clarity and precision.

Value	Short Tag	Description
0	indecipherable blurry hum	Indecipherable blurry hum
1	severely swallowed mumbled	Severely mumbled
2	consistently soft relaxed	Soft, relaxed articulation
3	neutral standard clear	Standard clear speech
4	crisp distinct professional	Crisp professional clarity
5	incredibly precise deliberate	Incredibly precise diction
6	hyper-articulated exaggerated	Hyper-articulated, exaggerated

DFLU — Disfluency (±1: 82.9%) Fillers, false starts, self-corrections.

Value	Short Tag	Description
0	pristine flawless perfect	Pristine, zero hesitations
1	highly polished professional	Highly polished delivery
2	highly fluent organic	Fluent with tiny organic imperfections
3	standard natural baseline	Standard natural filler rate
4	noticeably hesitant staggered	Noticeably hesitant and staggered
5	overwhelmingly messy chaotic	Overwhelmingly messy delivery
6	entirely shattered incoherent	Shattered, incoherent speech

ATCK — Attack (±1: 97.1%) Onset quality of phonation.

Value	Short Tag	Description
0	ghostly imperceptible fade	Ghostly fade-in onset
1	breathy diffused onset	Breathy, diffused onset
2	soft polite gentle	Soft, gentle onset
3	neutral standard balanced	Neutral balanced onset
4	hard clear square	Hard, square onset
5	violent percussive bark	Violent percussive bark
6	explosive glottal slam	Explosive glottal slam

COGL — Cognitive Load (±1: 60.0%) Mental processing effort audible in delivery.

Value	Short Tag	Description
0	perfectly fluid effortless	Perfectly fluid, zero effort
1	highly articulate efficient	Highly efficient processing
2	standard healthy natural	Standard healthy delivery
3	noticeable active searching	Noticeable word-searching
4	heavily burdened struggling	Heavily burdened, struggling
5	severely overloaded fracturing	Severely overloaded
6	catastrophically overwhelmed breakdown	Complete cognitive breakdown

Voice Quality (7 dimensions)

ROUG — Roughness (±1: 75.9%) Vocal fold irregularity and texture.

Value	Short Tag	Description
0	impossibly smooth pure	Impossibly smooth, pure tone
1	exceptionally velvety consistent	Velvety, consistent texture
2	standard healthy texture	Standard healthy vocal texture
3	distinctly grainy modulated	Distinctly grainy
4	heavily raspy weathered	Heavily raspy and weathered
5	aggressively harsh growling	Aggressively harsh growling
6	violently shredding chaotic	Violently shredding, chaotic

TENS — Tension (±1: 75.0%) Muscular tension in the vocal apparatus.

Value	Short Tag	Description
0	completely relaxed floppy	Completely relaxed, floppy
1	loose comfortable warmth	Loose, comfortable warmth
2	neutral conversational firmness	Neutral conversational firmness
3	mild muscular edge	Mild muscular edge
4	pressed highly restricted	Pressed, highly restricted
5	heavily strained grinding	Heavily strained, grinding
6	rigidly locked strangled	Rigidly locked, strangled

BRGT — Brightness (±1: 79.3%) High-frequency spectral energy.

Value	Short Tag	Description
0	completely muffled dark	Completely muffled, dark
1	severely reduced woolly	Severely reduced, woolly
2	gently attenuated warm	Gently warm, attenuated highs
3	perfectly neutral balanced	Perfectly balanced spectrum
4	well-defined crisp modern	Crisp, well-defined presence
5	heavily emphasized brilliant	Brilliant, heavily emphasized highs
6	overwhelmingly harsh piercing	Overwhelmingly harsh, piercing

WARM — Warmth (±1: 89.3%) Low-mid frequency richness and body.

Value	Short Tag	Description
0	surgically sterile cold	Surgically cold, sterile
1	tinny hollow top-heavy	Tinny, hollow
2	neutral functional cool	Neutral, functional
3	perfectly balanced baseline	Perfectly balanced
4	pleasant woody cozy	Pleasant, woody warmth
5	rich velvety late-night	Rich velvety late-night tone
6	overwhelmingly melting enveloping	Overwhelmingly enveloping warmth

FULL — Fullness/Body (±1: 61.5%) Spectral width and harmonic density.

Value	Short Tag	Description
0	paper-thin sliver	Paper-thin sliver of sound
1	highly restricted narrow	Highly restricted, narrow
2	slender lightweight	Slender, lightweight
3	naturally healthy	Natural healthy body
4	incredibly rich wide	Incredibly rich and wide
5	massive commanding	Massive, commanding
6	imax overwhelming	IMAX-like overwhelming body

HARM — Harmonicity (±1: 71.0%) Harmonic-to-noise ratio and tonal purity.

Value	Short Tag	Description
0	pure white noise	Pure noise, no harmonics
1	ghostly whisper	Ghostly whisper
2	breathy diffused	Breathy, diffused harmonics
3	naturally balanced	Naturally balanced
4	highly resonant focused	Highly resonant, focused
5	bell-like pure tonal	Bell-like pure tone
6	digitally synthetic	Digitally synthetic purity

METL — Metallicness (±1: 82.4%) Metallic/ringing spectral quality.

Value	Short Tag	Description
0	organic impossibly soft	Organic, impossibly soft
1	earthy negligible ping	Earthy, negligible ring
2	standard organic	Standard organic texture
3	subtly firm metallic	Subtle metallic edge
4	clearly clanging ring	Clearly metallic ring
5	fiercely piercing clanging	Fiercely piercing, clanging
6	pure robotic steel	Pure robotic steel

Resonance (7 dimensions)

Speaker Attributes (5 dimensions)

Delivery & Structure (5 dimensions)

Style Dimensions (15 dimensions)

Recording & Content (4 dimensions)

Training Details

Data source: Audio samples from voice taxonomy bucket reports, validated by Qwen3-Omni audio model
Encoder features: BUD-E-Whisper V1.0 + V1.1 mean-pooled temporal halves → 3072-dim
Dimensionality reduction: Per-dimension PCA from 3072 → 96 components
Classifier: 1-hidden-layer MLP with ReLU (96 → 96 → n_classes)
Training: Adam optimizer, lr=1e-3, max 200 epochs with early stopping (patience=20)
Loss: Cross-entropy with inverse-frequency class weighting
Validation: Stratified holdout (5 per class)
Total parameters: ~570K across all 57 classifiers + 57 PCA matrices

File Structure

Voice-Taxonomy-57/
├── README.md                      # This file
├── inference.py                   # Standalone inference pipeline
├── taxonomy_classifiers.pkl       # 57 MLP classifiers + PCA weights (~68 MB)
├── taxonomy_tags_short.json       # Short 2-3 word tags per bucket
├── taxonomy_tags_sentences.json   # 5-10 word sentence tags per bucket
├── taxonomy_dimensions.json       # Full dimension descriptions
├── config.json                    # Model configuration
└── requirements.txt               # Dependencies

Related Models

laion/BUD-E-Whisper — V1.0 encoder
laion/BUD-E-Whisper_V1.1 — V1.1 encoder
laion/Empathic-Insight-Voice-Plus — 55 emotion + 4 quality dimensions (shares V1.0 encoder)

Citation

@misc{voice-taxonomy-57,
  title={Voice-Taxonomy-57: 57-Dimension Voice Taxonomy Classifier},
  author={LAION},
  year={2025},
  url={https://huggingface.co/laion/Voice-Taxonomy-57}
}

License

Apache 2.0

Downloads last month: 6