Voice-Taxonomy-57

57-dimension voice taxonomy classifier that analyzes speech audio across temporal dynamics, prosody, voice quality, resonance, style, and paralinguistic features.

Each dimension predicts a value on a 0–6 scale with human-readable tags. The model achieves 76.2% within-1-class accuracy (Β±1) across all 57 dimensions.

Quick Start

from inference import VoiceTaxonomy57

# Load model
model = VoiceTaxonomy57.from_pretrained("laion/Voice-Taxonomy-57")

# Predict from audio file
result = model.predict("speech.wav")

# Print all 57 dimension predictions
for dim, pred in sorted(result.items()):
    print(f"{dim}: {pred['value']} β€” {pred['tag_short']} ({pred['confidence']:.0%})")

# Get compact tag string
print(model.format_tags(result, format="short"))
# β†’ "peak adult vigor, baseline alert present, completely static unchanging, ..."

# Batch inference
results = model.predict(["a.wav", "b.mp3", "c.flac"], batch_size=16)

CLI Usage

# JSON output
python inference.py --input audio_folder/ --output results.json --batch-size 16

# Human-readable tags
python inference.py --input speech.wav --format tags-short

# CPU inference
python inference.py --input speech.wav --device cpu --fp32

Architecture

Audio (any format) β†’ ffmpeg decode β†’ 16kHz mono
  β†’ WhisperFeatureExtractor β†’ mel spectrogram
  β†’ BUD-E-Whisper V1.0 encoder β†’ [B, 1500, 768]
  β†’ BUD-E-Whisper V1.1 encoder β†’ [B, 1500, 768]
  β†’ Duration-aware frame truncation
  β†’ Split first half / second half
  β†’ Mean-pool each half per encoder
  β†’ Concatenate: [V1.1_first, V1.1_second, V1.0_first, V1.0_second] β†’ [B, 3072]
  β†’ Per-dimension PCA(96) β†’ [B, 96]
  β†’ Per-dimension MLP(96 β†’ 96 β†’ n_classes) β†’ prediction (0–6)

Encoder: Two fine-tuned Whisper-small encoders (BUD-E-Whisper V1.0 and V1.1) produce complementary voice representations.

Feature extraction: Each encoder's hidden states are split at the temporal midpoint and mean-pooled, yielding 4 vectors of 768 dims each (3072 total). This captures both early and late voice characteristics.

Classification: Per-dimension PCA reduces 3072 β†’ 96 dims, then a small MLP (96 β†’ ReLU β†’ 96 β†’ n_classes) produces class logits. Total: ~57 Γ— 10K β‰ˆ 570K trainable parameters.

Integration with Empathic-Insight-Voice-Plus

Both pipelines share the BUD-E-Whisper V1.0 encoder. To avoid running it twice:

from transformers import WhisperModel, WhisperFeatureExtractor
from inference import VoiceTaxonomy57

# Load taxonomy model without auto-loading encoders
taxonomy = VoiceTaxonomy57.from_pretrained("laion/Voice-Taxonomy-57", load_encoders=False)

# Load shared V1.0 encoder (used by both pipelines)
v10 = WhisperModel.from_pretrained("laion/BUD-E-Whisper").encoder.cuda().eval()
v11 = WhisperModel.from_pretrained("laion/BUD-E-Whisper_V1.1").encoder.cuda().eval()

# Run encoders once
fe = WhisperFeatureExtractor.from_pretrained("laion/BUD-E-Whisper")
mel = fe(waveforms, sampling_rate=16000, return_tensors="pt").input_features.cuda()
with torch.no_grad():
    v10_hidden = v10(mel).last_hidden_state
    v11_hidden = v11(mel).last_hidden_state

# Feed V1.0 hidden states to Empathic-Insight-Voice-Plus emotion pipeline
# (55 emotion MLPs + 4 quality MLPs + whisper decoder for captions)
# ... emotion_results = empathic_model.predict_from_encoder(v10_hidden) ...

# Feed both to taxonomy pipeline (no re-encoding needed)
taxonomy_results = taxonomy.predict_from_encoder_outputs(
    v10_hidden_states=v10_hidden,
    v11_hidden_states=v11_hidden,
    durations=[len(wf) / 16000 for wf in waveforms],
)
# Combined: 55 emotions + 4 quality + 57 taxonomy = 116 voice annotations

Performance

Overall

Metric Value
Dimensions 57
Mean exact accuracy 47.6%
Mean Β±1 accuracy 76.2%
Tier A (Β±1 β‰₯ 85%) 15 dims
Tier B (Β±1 β‰₯ 70%) 26 dims
Tier C (Β±1 β‰₯ 55%) 14 dims
Tier D (Β±1 < 55%) 2 dims

Per-Dimension Accuracy

Dim Name Classes Exact Β±1 Tier
AGEV Age of Voice 7 25.7% 68.6% C
AROU Arousal 7 54.3% 82.9% B
ARSH Arousal Shift 7 62.9% 88.6% A
ATCK Attack 7 60.0% 97.1% A
BKGN Background Noise 7 37.1% 68.6% C
BRGT Brightness 6 48.3% 79.3% B
CHNK Chunking 7 42.4% 69.7% C
CLRT Articulation 7 40.0% 71.4% B
COGL Cognitive Load 7 40.0% 60.0% C
DARC Dynamic Arc 7 55.9% 67.6% C
DFLU Disfluency 7 51.4% 82.9% B
EMPH Emphasis 7 51.4% 74.3% B
ESTH Aesthetic Quality 7 48.6% 85.7% A
EXPL Explicitness 7 28.6% 77.1% B
FOCS Focus/Engagement 7 42.9% 80.0% B
FULL Fullness/Body 7 23.1% 61.5% C
GEND Gender Presentation 7 42.9% 71.4% B
HARM Harmonicity 7 54.8% 71.0% B
METL Metallicness 7 52.9% 82.4% B
RANG Pitch Range 6 46.7% 76.7% B
RCQL Recording Quality 7 22.9% 48.6% D
REGS Register 7 65.7% 88.6% A
RESP Respiratory Audibility 7 57.1% 85.7% A
ROUG Roughness 6 51.7% 75.9% B
R_CHST Chest Resonance 7 67.6% 88.2% A
R_HEAD Head Resonance 7 29.4% 73.5% B
R_MASK Mask Resonance 7 68.8% 90.6% A
R_MIXD Mixed Resonance 6 36.7% 66.7% C
R_NASL Nasal Resonance 6 45.8% 75.0% B
R_ORAL Oral Resonance 7 37.5% 56.2% C
R_THRT Throat Resonance 7 48.6% 77.1% B
SMTH Smoothness 7 57.1% 88.6% A
STNC Stance 7 42.9% 74.3% B
STRU Structure 7 48.4% 90.3% A
S_ASMR ASMR Style 7 56.2% 75.0% B
S_AUTH Authoritative Style 7 64.7% 97.1% A
S_CART Cartoon Style 7 45.7% 85.7% A
S_CASU Casual Style 7 31.0% 58.6% C
S_CONV Conversational Style 7 48.5% 84.8% B
S_DRAM Dramatic Style 7 25.7% 42.9% D
S_FORM Formality 7 60.6% 90.9% A
S_MONO Monologue Style 7 62.9% 80.0% B
S_NARR Narration Style 7 41.9% 80.6% B
S_NEWS Newscaster Style 7 42.9% 77.1% B
S_PLAY Playfulness 7 31.4% 60.0% C
S_RANT Rant Style 7 54.3% 85.7% A
S_STRY Storytelling Style 7 60.6% 84.8% B
S_TECH Technical Style 7 58.1% 83.9% B
S_WHIS Whisper Style 7 44.1% 61.8% C
TEMP Tempo 7 42.4% 63.6% C
TENS Tension 7 56.2% 75.0% B
VALN Valence 7 60.0% 77.1% B
VALS Valence Shift 7 28.6% 62.9% C
VFLX Velocity Flex 7 57.1% 94.3% A
VOLT Volatility 7 57.1% 77.1% B
VULN Vulnerability 7 31.4% 57.1% C
WARM Warmth 7 60.7% 89.3% A

All 57 Dimensions

Temporal Dynamics (6 dimensions)

TEMP β€” Tempo (Β±1: 63.6%) Mechanical speed of word/syllable production.

Value Short Tag Description
0 glacially slow Glacially slow, syllables stretched to breaking point
1 heavily deliberate Unusually deliberate and labored word production
2 relaxed unhurried Relaxed, slightly below conversational average
3 standard conversational Standard everyday speech tempo
4 brisk elevated Brisk, high-engagement forward push
5 noticeably compressed fast Words compressed, high-density acoustic stream
6 blistering hyper-accelerated Absolute human limit of linguistic speed

CHNK β€” Chunking (Β±1: 69.7%) Breath unit grouping and pause frequency.

Value Short Tag Description
0 severely fragmented syllables Single syllables broken by massive gaps
1 very choppy stop-go Short choppy two-word bursts
2 consistently shorter pedantic Cautious, highly separated word groupings
3 naturally medium balanced Natural sentence-sized breath groups
4 noticeably extended sweeping Extended multi-sentence breath units
5 very long dense Very long dense streams with brief inhalations
6 massive continuous unbroken Continuous wall of words, no pauses

SMTH β€” Smoothness (Β±1: 88.6%) Timing regularity and transition fluidity.

Value Short Tag Description
0 completely chaotic spasmodic Chaotic, spasmodic timing with random glitches
1 sharply detached staccato Sharp staccato machine-gun-like delivery
2 uneven bumpy clumsy Uneven, bumpy with micro-hesitations
3 standard naturally flexible Standard natural rhythm with organic flexibility
4 noticeably consistent practiced Consistent, practiced, professional timing
5 flowing silky legato Silky legato with seamless transitions
6 mathematically perfect metronomic Mathematically perfect, metronomic timing

VFLX β€” Velocity Flex (Β±1: 94.3%) Change in speech speed across the clip.

Value Short Tag Description
0 massive deceleration grinding Massive deceleration, grinding to a halt
1 heavy sustained slowdown Heavy, sustained slowdown
2 subtle natural easing Subtle natural easing of pace
3 perfectly locked steady Perfectly steady, locked tempo
4 subtle forward lean Subtle forward acceleration
5 clear sustained acceleration Clear sustained acceleration
6 extreme acceleration explosion Extreme acceleration explosion

ARSH β€” Arousal Shift (Β±1: 88.6%) Change in autonomic energy across the clip.

Value Short Tag Description
0 total catastrophic collapse From high activation to complete collapse
1 massive rapid de-escalation Significant drop in energy
2 subtle gentle settling Comforting settling of tension
3 completely static unchanging Constant arousal level
4 subtle engaging perking Light rising alertness
5 clear aggressive escalation Aggressive energy escalation
6 violent explosive detonation 0-to-100 explosion into panic

DARC β€” Dynamic Arc (Β±1: 67.6%) Loudness trajectory across the clip.

Value Short Tag Description
0 total catastrophic fade Fade from loud to inaudible
1 clear pronounced diminuendo Clear diminuendo
2 gentle natural softening Gentle natural softening
3 absolutely flatlined constant Perfectly constant volume
4 controlled satisfying bell Controlled bell-curve shape
5 clear aggressive crescendo Aggressive crescendo
6 impossibly extreme violent Extreme volume explosion

Prosody & Pitch (4 dimensions)

RANG β€” Pitch Range (Β±1: 76.7%) Vertical movement of fundamental frequency.

Value Short Tag Description
0 perfectly flat monotone Zero pitch variation, pure drone
1 severely suppressed Microscopic pitch movement
2 tightly restrained narrow Narrow, controlled pitch window
3 naturally balanced Standard conversational intonation
4 expressively wide Wide, colorful pitch movement
5 highly melodic sweeping Sweeping dramatic pitch jumps
6 wildly operatic extreme Wild operatic pitch extremes

EMPH β€” Emphasis (Β±1: 74.3%) Word stress and informational hierarchy.

Value Short Tag Description
0 flat monotone Every word identical weight
1 barely stressed Nearly undetectable stress shifts
2 softly highlighted Light, polite word highlighting
3 naturally stressed Clear natural emphasis pattern
4 strongly marked Strong acoustic hierarchy
5 aggressively punched Aggressive percussive stress
6 violently explosive Violent explosive emphasis

REGS β€” Register (Β±1: 88.6%) Vocal register from bass to soprano.

Value Short Tag Description
0 basso profondo extreme Extreme low bass register
1 baritone grounded warm Warm grounded baritone
2 tenor bright lifted Bright lifted tenor
3 bridged contralto countertenor Bridged middle register
4 mezzo-soprano balanced Balanced mezzo-soprano
5 soprano bright brilliant Bright brilliant soprano
6 coloratura whistle extreme Extreme high whistle register

VOLT β€” Volatility (Β±1: 77.1%) Stability/instability of vocal parameters.

Value Short Tag Description
0 absolutely frozen static Frozen, perfectly static voice
1 highly steady unshakeable Rock-steady, unshakeable
2 stable minor shifts Stable with minor shifts
3 natural organic breathing Natural organic variation
4 minor emotional flickering Emotional flickering
5 highly unstable jarring Highly unstable and jarring
6 completely chaotic cycling Completely chaotic cycling

Articulation & Fluency (4 dimensions)

CLRT β€” Articulation (Β±1: 71.4%) Consonant/vowel clarity and precision.

Value Short Tag Description
0 indecipherable blurry hum Indecipherable blurry hum
1 severely swallowed mumbled Severely mumbled
2 consistently soft relaxed Soft, relaxed articulation
3 neutral standard clear Standard clear speech
4 crisp distinct professional Crisp professional clarity
5 incredibly precise deliberate Incredibly precise diction
6 hyper-articulated exaggerated Hyper-articulated, exaggerated

DFLU β€” Disfluency (Β±1: 82.9%) Fillers, false starts, self-corrections.

Value Short Tag Description
0 pristine flawless perfect Pristine, zero hesitations
1 highly polished professional Highly polished delivery
2 highly fluent organic Fluent with tiny organic imperfections
3 standard natural baseline Standard natural filler rate
4 noticeably hesitant staggered Noticeably hesitant and staggered
5 overwhelmingly messy chaotic Overwhelmingly messy delivery
6 entirely shattered incoherent Shattered, incoherent speech

ATCK β€” Attack (Β±1: 97.1%) Onset quality of phonation.

Value Short Tag Description
0 ghostly imperceptible fade Ghostly fade-in onset
1 breathy diffused onset Breathy, diffused onset
2 soft polite gentle Soft, gentle onset
3 neutral standard balanced Neutral balanced onset
4 hard clear square Hard, square onset
5 violent percussive bark Violent percussive bark
6 explosive glottal slam Explosive glottal slam

COGL β€” Cognitive Load (Β±1: 60.0%) Mental processing effort audible in delivery.

Value Short Tag Description
0 perfectly fluid effortless Perfectly fluid, zero effort
1 highly articulate efficient Highly efficient processing
2 standard healthy natural Standard healthy delivery
3 noticeable active searching Noticeable word-searching
4 heavily burdened struggling Heavily burdened, struggling
5 severely overloaded fracturing Severely overloaded
6 catastrophically overwhelmed breakdown Complete cognitive breakdown

Voice Quality (7 dimensions)

ROUG β€” Roughness (Β±1: 75.9%) Vocal fold irregularity and texture.

Value Short Tag Description
0 impossibly smooth pure Impossibly smooth, pure tone
1 exceptionally velvety consistent Velvety, consistent texture
2 standard healthy texture Standard healthy vocal texture
3 distinctly grainy modulated Distinctly grainy
4 heavily raspy weathered Heavily raspy and weathered
5 aggressively harsh growling Aggressively harsh growling
6 violently shredding chaotic Violently shredding, chaotic

TENS β€” Tension (Β±1: 75.0%) Muscular tension in the vocal apparatus.

Value Short Tag Description
0 completely relaxed floppy Completely relaxed, floppy
1 loose comfortable warmth Loose, comfortable warmth
2 neutral conversational firmness Neutral conversational firmness
3 mild muscular edge Mild muscular edge
4 pressed highly restricted Pressed, highly restricted
5 heavily strained grinding Heavily strained, grinding
6 rigidly locked strangled Rigidly locked, strangled

BRGT β€” Brightness (Β±1: 79.3%) High-frequency spectral energy.

Value Short Tag Description
0 completely muffled dark Completely muffled, dark
1 severely reduced woolly Severely reduced, woolly
2 gently attenuated warm Gently warm, attenuated highs
3 perfectly neutral balanced Perfectly balanced spectrum
4 well-defined crisp modern Crisp, well-defined presence
5 heavily emphasized brilliant Brilliant, heavily emphasized highs
6 overwhelmingly harsh piercing Overwhelmingly harsh, piercing

WARM β€” Warmth (Β±1: 89.3%) Low-mid frequency richness and body.

Value Short Tag Description
0 surgically sterile cold Surgically cold, sterile
1 tinny hollow top-heavy Tinny, hollow
2 neutral functional cool Neutral, functional
3 perfectly balanced baseline Perfectly balanced
4 pleasant woody cozy Pleasant, woody warmth
5 rich velvety late-night Rich velvety late-night tone
6 overwhelmingly melting enveloping Overwhelmingly enveloping warmth

FULL β€” Fullness/Body (Β±1: 61.5%) Spectral width and harmonic density.

Value Short Tag Description
0 paper-thin sliver Paper-thin sliver of sound
1 highly restricted narrow Highly restricted, narrow
2 slender lightweight Slender, lightweight
3 naturally healthy Natural healthy body
4 incredibly rich wide Incredibly rich and wide
5 massive commanding Massive, commanding
6 imax overwhelming IMAX-like overwhelming body

HARM β€” Harmonicity (Β±1: 71.0%) Harmonic-to-noise ratio and tonal purity.

Value Short Tag Description
0 pure white noise Pure noise, no harmonics
1 ghostly whisper Ghostly whisper
2 breathy diffused Breathy, diffused harmonics
3 naturally balanced Naturally balanced
4 highly resonant focused Highly resonant, focused
5 bell-like pure tonal Bell-like pure tone
6 digitally synthetic Digitally synthetic purity

METL β€” Metallicness (Β±1: 82.4%) Metallic/ringing spectral quality.

Value Short Tag Description
0 organic impossibly soft Organic, impossibly soft
1 earthy negligible ping Earthy, negligible ring
2 standard organic Standard organic texture
3 subtly firm metallic Subtle metallic edge
4 clearly clanging ring Clearly metallic ring
5 fiercely piercing clanging Fiercely piercing, clanging
6 pure robotic steel Pure robotic steel

Resonance (7 dimensions)

R_CHST β€” Chest Resonance (Β±1: 88.2%) | 0: weightless | 1: faintest warmth | 2: mild chest | 3: balanced grounded | 4: pronounced chest | 5: massive booming | 6: floor-shaking sub-bass |

R_HEAD β€” Head Resonance (Β±1: 73.5%) | 0: body-locked | 1: microscopic lift | 2: gentle upper | 3: floating gentle | 4: airy lofty | 5: falsetto | 6: extreme whistle |

R_MASK β€” Mask Resonance (Β±1: 90.6%) | 0: dull dead | 1: microscopic focus | 2: present clarity | 3: pingy cutting | 4: metallic theatrical | 5: piercing drilled | 6: laser siren |

R_MIXD β€” Mixed Resonance (Β±1: 66.7%) | 0: unbalanced fractured | 1: struggling blend | 2: split un-integrated | 3: connected unified | 4: rich blended | 5: full-spectrum | 6: impossibly huge |

R_NASL β€” Nasal Resonance (Β±1: 75.0%) | 0: blocked denasal | 1: clean pure | 2: subtle twang | 3: obvious pinched | 4: deliberate pinched | 5: unpleasant piercing | 6: hypernasal bleating |

R_ORAL β€” Oral Resonance (Β±1: 56.2%) | 0: displaced non-oral | 1: vague unfocused | 2: mostly oral | 3: balanced neutral | 4: clear forward | 5: exaggerated open | 6: megaphone projection |

R_THRT β€” Throat Resonance (Β±1: 77.1%) | 0: open relaxed | 1: faint coloring | 2: pharyngeal pulled | 3: centered guttural | 4: pharyngeal choked | 5: strangled distorted | 6: imploded swallowed |

Speaker Attributes (5 dimensions)

AGEV β€” Age of Voice (Β±1: 68.6%) | 0: neonatal | 1: child | 2: adolescent | 3: peak adult | 4: matured | 5: middle aging | 6: extreme senescence |

GEND β€” Gender Presentation (Β±1: 71.4%) | 0: hyper-feminine soprano | 1: clearly feminine | 2: feminine alto | 3: androgynous | 4: masculine tenor | 5: standard masculine | 6: hyper-masculine bass |

AROU β€” Arousal (Β±1: 82.9%) | 0: comatose | 1: deeply sedate | 2: reserved controlled | 3: baseline alert | 4: elevated excited | 5: highly intense | 6: maximum screaming |

VALN β€” Valence (Β±1: 77.1%) | 0: pure suffering | 1: heavily saddened | 2: subtly negative | 3: neutral baseline | 4: politely warm | 5: cheerful happiness | 6: euphoric joy |

VALS β€” Valence Shift (Β±1: 62.9%) | 0: tragic plunge | 1: mood darkening | 2: subtle worry | 3: perfectly stable | 4: gentle brightening | 5: powerful uplift | 6: euphoric redemption |

Delivery & Structure (5 dimensions)

STRU β€” Structure (Β±1: 90.3%) | 0: scattered disconnected | 1: jumpy erratic | 2: loose rambling | 3: basic coherent | 4: linear organized | 5: organized signposted | 6: perfect masterclass |

STNC β€” Stance (Β±1: 74.3%) | 0: tiny submissive | 1: reserved cautious | 2: cooperative friendly | 3: neutral informative | 4: assertive confident | 5: authoritative commanding | 6: dominating aggressive |

FOCS β€” Focus/Engagement (Β±1: 80.0%) | 0: deeply dissociated | 1: emotionally detached | 2: distracted split | 3: politely engaged | 4: clearly engaged | 5: laser-focused | 6: obsessively hypnotic |

VULN β€” Vulnerability (Β±1: 57.1%) | 0: armored impenetrable | 1: guarded professional | 2: composed adult | 3: open accessible | 4: empathetic permeable | 5: completely raw | 6: profoundly naked |

RESP β€” Respiratory Audibility (Β±1: 85.7%) | 0: invisible | 1: controlled micro | 2: natural conversational | 3: noticeably elevated | 4: highly labored | 5: rapid gasping | 6: catastrophic failure |

Style Dimensions (15 dimensions)

S_AUTH β€” Authoritative (Β±1: 97.1%) | 0: trembling submissive | 1: weakly uncertain | 2: peer suggestion | 3: firm directive | 4: commanding forceful | 5: aggressive dominance | 6: terrifying assault |

S_FORM β€” Formality (Β±1: 90.9%) | 0: vulgar aggressive | 1: casual everyday | 2: service polite | 3: boardroom professional | 4: official rigid | 5: ceremonial grave | 6: royal imperial |

S_NARR β€” Narration (Β±1: 80.6%) | 0: stumbling confused | 1: child classroom | 2: dry academic | 3: warm engaging | 4: audiobook elite | 5: epic cinematic | 6: god voice authority |

S_STRY β€” Storytelling (Β±1: 84.8%) | 0: robotic data reader | 1: dry facts only | 2: casual bar story | 3: campfire storyteller | 4: professional voice actor | 5: epic fantasy bard | 6: hypnotic ancient bard |

S_DRAM β€” Dramatic (Β±1: 42.9%) | 0: deadpan bored | 1: bad flat acting | 2: cinematic subtle | 3: stage emotional | 4: shakespearean epic | 5: melodramatic soap | 6: operatic meltdown |

S_CART β€” Cartoon (Β±1: 85.7%) | 0: hyper-realistic | 1: completely natural | 2: party impression | 3: sitcom character | 4: wacky energetic | 5: zany chaotic | 6: looney tunes extreme |

S_NEWS β€” Newscaster (Β±1: 77.1%) | 0: rambling chaotic | 1: casual conversational | 2: youtube opinionated | 3: field reporter | 4: studio anchor | 5: gravitas authoritative | 6: 70s exaggerated |

S_TECH β€” Technical (Β±1: 83.9%) | 0: obscure jargon rush | 1: unclear rapid lecturing | 2: dry professor | 3: engaging tutorial host | 4: precise emergency manual | 5: condescending mansplain | 6: hyperactive preschool |

S_CONV β€” Conversational (Β±1: 84.8%) | 0: aggressive monologue | 1: detached uninteractive | 2: internal soliloquy | 3: balanced interactive | 4: warmly engaged | 5: eager checking | 6: manic extrovert |

S_CASU β€” Casual (Β±1: 58.6%) | 0: robotic rigid | 1: nervously stiff | 2: polite guarded | 3: relaxed friendly | 4: loose informal | 5: lazy mumbled | 6: drunk slacker |

S_RANT β€” Rant (Β±1: 85.7%) | 0: diplomatic soothing | 1: calm rational | 2: passive annoyed | 3: energetic venting | 4: heated aggressive | 5: political tirade | 6: manic rage |

S_PLAY β€” Playfulness (Β±1: 60.0%) | 0: grim funeral | 1: serious business | 2: deadpan ironic | 3: friendly warm | 4: joyful joking | 5: mischievous teasing | 6: whimsical ecstatic |

S_ASMR β€” ASMR (Β±1: 75.0%) | 0: stadium screaming | 1: public projection | 2: hushed intimate | 3: warm soothing | 4: pure whisper airy | 5: textural tingles | 6: hyper-intimate binaural |

S_WHIS β€” Whisper (Β±1: 61.8%) | 0: full-voiced projection | 1: polite indoor voice | 2: intimate breathy | 3: piercing stage whisper | 4: library whisper | 5: rasping batman growl | 6: ghostly ethereal |

S_MONO β€” Monologue (Β±1: 80.0%) | 0: interactive dialogue | 1: public formal address | 2: thinking aloud | 3: diary intimate | 4: theatrical tragic | 5: noir voiceover | 6: noir detective extreme |

Recording & Content (4 dimensions)

RCQL β€” Recording Quality (Β±1: 48.6%) | 0: severely corrupted | 1: amateur flawed | 2: below average | 3: standard consumer | 4: good prosumer | 5: broadcast pristine | 6: audiophile flawless |

BKGN β€” Background Noise (Β±1: 68.6%) | 0: absolute vacuum | 1: imperceptible quiet | 2: slight hiss | 3: noticeable ambience | 4: highly noisy | 5: chaotic overwhelming | 6: catastrophic masking |

ESTH β€” Aesthetic Quality (Β±1: 85.7%) | 0: viscerally repulsive | 1: deeply grating | 2: mundane boring | 3: acceptably standard | 4: genuinely pleasant | 5: star power radiant | 6: sublimely transcendent |

EXPL β€” Explicitness (Β±1: 77.1%) | 0: completely innocent | 1: mildly casual | 2: moderately mature | 3: distinctly restricted | 4: highly explicit | 5: extremely graphic | 6: deeply illegal |

Training Details

  • Data source: Audio samples from voice taxonomy bucket reports, validated by Qwen3-Omni audio model
  • Encoder features: BUD-E-Whisper V1.0 + V1.1 mean-pooled temporal halves β†’ 3072-dim
  • Dimensionality reduction: Per-dimension PCA from 3072 β†’ 96 components
  • Classifier: 1-hidden-layer MLP with ReLU (96 β†’ 96 β†’ n_classes)
  • Training: Adam optimizer, lr=1e-3, max 200 epochs with early stopping (patience=20)
  • Loss: Cross-entropy with inverse-frequency class weighting
  • Validation: Stratified holdout (5 per class)
  • Total parameters: ~570K across all 57 classifiers + 57 PCA matrices

File Structure

Voice-Taxonomy-57/
β”œβ”€β”€ README.md                      # This file
β”œβ”€β”€ inference.py                   # Standalone inference pipeline
β”œβ”€β”€ taxonomy_classifiers.pkl       # 57 MLP classifiers + PCA weights (~68 MB)
β”œβ”€β”€ taxonomy_tags_short.json       # Short 2-3 word tags per bucket
β”œβ”€β”€ taxonomy_tags_sentences.json   # 5-10 word sentence tags per bucket
β”œβ”€β”€ taxonomy_dimensions.json       # Full dimension descriptions
β”œβ”€β”€ config.json                    # Model configuration
└── requirements.txt               # Dependencies

Related Models

Citation

@misc{voice-taxonomy-57,
  title={Voice-Taxonomy-57: 57-Dimension Voice Taxonomy Classifier},
  author={LAION},
  year={2025},
  url={https://huggingface.co/laion/Voice-Taxonomy-57}
}

License

Apache 2.0

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support