Voice-Taxonomy-57
57-dimension voice taxonomy classifier that analyzes speech audio across temporal dynamics, prosody, voice quality, resonance, style, and paralinguistic features.
Each dimension predicts a value on a 0β6 scale with human-readable tags. The model achieves 76.2% within-1-class accuracy (Β±1) across all 57 dimensions.
Quick Start
from inference import VoiceTaxonomy57
# Load model
model = VoiceTaxonomy57.from_pretrained("laion/Voice-Taxonomy-57")
# Predict from audio file
result = model.predict("speech.wav")
# Print all 57 dimension predictions
for dim, pred in sorted(result.items()):
print(f"{dim}: {pred['value']} β {pred['tag_short']} ({pred['confidence']:.0%})")
# Get compact tag string
print(model.format_tags(result, format="short"))
# β "peak adult vigor, baseline alert present, completely static unchanging, ..."
# Batch inference
results = model.predict(["a.wav", "b.mp3", "c.flac"], batch_size=16)
CLI Usage
# JSON output
python inference.py --input audio_folder/ --output results.json --batch-size 16
# Human-readable tags
python inference.py --input speech.wav --format tags-short
# CPU inference
python inference.py --input speech.wav --device cpu --fp32
Architecture
Audio (any format) β ffmpeg decode β 16kHz mono
β WhisperFeatureExtractor β mel spectrogram
β BUD-E-Whisper V1.0 encoder β [B, 1500, 768]
β BUD-E-Whisper V1.1 encoder β [B, 1500, 768]
β Duration-aware frame truncation
β Split first half / second half
β Mean-pool each half per encoder
β Concatenate: [V1.1_first, V1.1_second, V1.0_first, V1.0_second] β [B, 3072]
β Per-dimension PCA(96) β [B, 96]
β Per-dimension MLP(96 β 96 β n_classes) β prediction (0β6)
Encoder: Two fine-tuned Whisper-small encoders (BUD-E-Whisper V1.0 and V1.1) produce complementary voice representations.
Feature extraction: Each encoder's hidden states are split at the temporal midpoint and mean-pooled, yielding 4 vectors of 768 dims each (3072 total). This captures both early and late voice characteristics.
Classification: Per-dimension PCA reduces 3072 β 96 dims, then a small MLP (96 β ReLU β 96 β n_classes) produces class logits. Total: ~57 Γ 10K β 570K trainable parameters.
Integration with Empathic-Insight-Voice-Plus
Both pipelines share the BUD-E-Whisper V1.0 encoder. To avoid running it twice:
from transformers import WhisperModel, WhisperFeatureExtractor
from inference import VoiceTaxonomy57
# Load taxonomy model without auto-loading encoders
taxonomy = VoiceTaxonomy57.from_pretrained("laion/Voice-Taxonomy-57", load_encoders=False)
# Load shared V1.0 encoder (used by both pipelines)
v10 = WhisperModel.from_pretrained("laion/BUD-E-Whisper").encoder.cuda().eval()
v11 = WhisperModel.from_pretrained("laion/BUD-E-Whisper_V1.1").encoder.cuda().eval()
# Run encoders once
fe = WhisperFeatureExtractor.from_pretrained("laion/BUD-E-Whisper")
mel = fe(waveforms, sampling_rate=16000, return_tensors="pt").input_features.cuda()
with torch.no_grad():
v10_hidden = v10(mel).last_hidden_state
v11_hidden = v11(mel).last_hidden_state
# Feed V1.0 hidden states to Empathic-Insight-Voice-Plus emotion pipeline
# (55 emotion MLPs + 4 quality MLPs + whisper decoder for captions)
# ... emotion_results = empathic_model.predict_from_encoder(v10_hidden) ...
# Feed both to taxonomy pipeline (no re-encoding needed)
taxonomy_results = taxonomy.predict_from_encoder_outputs(
v10_hidden_states=v10_hidden,
v11_hidden_states=v11_hidden,
durations=[len(wf) / 16000 for wf in waveforms],
)
# Combined: 55 emotions + 4 quality + 57 taxonomy = 116 voice annotations
Performance
Overall
| Metric | Value |
|---|---|
| Dimensions | 57 |
| Mean exact accuracy | 47.6% |
| Mean Β±1 accuracy | 76.2% |
| Tier A (Β±1 β₯ 85%) | 15 dims |
| Tier B (Β±1 β₯ 70%) | 26 dims |
| Tier C (Β±1 β₯ 55%) | 14 dims |
| Tier D (Β±1 < 55%) | 2 dims |
Per-Dimension Accuracy
| Dim | Name | Classes | Exact | Β±1 | Tier |
|---|---|---|---|---|---|
| AGEV | Age of Voice | 7 | 25.7% | 68.6% | C |
| AROU | Arousal | 7 | 54.3% | 82.9% | B |
| ARSH | Arousal Shift | 7 | 62.9% | 88.6% | A |
| ATCK | Attack | 7 | 60.0% | 97.1% | A |
| BKGN | Background Noise | 7 | 37.1% | 68.6% | C |
| BRGT | Brightness | 6 | 48.3% | 79.3% | B |
| CHNK | Chunking | 7 | 42.4% | 69.7% | C |
| CLRT | Articulation | 7 | 40.0% | 71.4% | B |
| COGL | Cognitive Load | 7 | 40.0% | 60.0% | C |
| DARC | Dynamic Arc | 7 | 55.9% | 67.6% | C |
| DFLU | Disfluency | 7 | 51.4% | 82.9% | B |
| EMPH | Emphasis | 7 | 51.4% | 74.3% | B |
| ESTH | Aesthetic Quality | 7 | 48.6% | 85.7% | A |
| EXPL | Explicitness | 7 | 28.6% | 77.1% | B |
| FOCS | Focus/Engagement | 7 | 42.9% | 80.0% | B |
| FULL | Fullness/Body | 7 | 23.1% | 61.5% | C |
| GEND | Gender Presentation | 7 | 42.9% | 71.4% | B |
| HARM | Harmonicity | 7 | 54.8% | 71.0% | B |
| METL | Metallicness | 7 | 52.9% | 82.4% | B |
| RANG | Pitch Range | 6 | 46.7% | 76.7% | B |
| RCQL | Recording Quality | 7 | 22.9% | 48.6% | D |
| REGS | Register | 7 | 65.7% | 88.6% | A |
| RESP | Respiratory Audibility | 7 | 57.1% | 85.7% | A |
| ROUG | Roughness | 6 | 51.7% | 75.9% | B |
| R_CHST | Chest Resonance | 7 | 67.6% | 88.2% | A |
| R_HEAD | Head Resonance | 7 | 29.4% | 73.5% | B |
| R_MASK | Mask Resonance | 7 | 68.8% | 90.6% | A |
| R_MIXD | Mixed Resonance | 6 | 36.7% | 66.7% | C |
| R_NASL | Nasal Resonance | 6 | 45.8% | 75.0% | B |
| R_ORAL | Oral Resonance | 7 | 37.5% | 56.2% | C |
| R_THRT | Throat Resonance | 7 | 48.6% | 77.1% | B |
| SMTH | Smoothness | 7 | 57.1% | 88.6% | A |
| STNC | Stance | 7 | 42.9% | 74.3% | B |
| STRU | Structure | 7 | 48.4% | 90.3% | A |
| S_ASMR | ASMR Style | 7 | 56.2% | 75.0% | B |
| S_AUTH | Authoritative Style | 7 | 64.7% | 97.1% | A |
| S_CART | Cartoon Style | 7 | 45.7% | 85.7% | A |
| S_CASU | Casual Style | 7 | 31.0% | 58.6% | C |
| S_CONV | Conversational Style | 7 | 48.5% | 84.8% | B |
| S_DRAM | Dramatic Style | 7 | 25.7% | 42.9% | D |
| S_FORM | Formality | 7 | 60.6% | 90.9% | A |
| S_MONO | Monologue Style | 7 | 62.9% | 80.0% | B |
| S_NARR | Narration Style | 7 | 41.9% | 80.6% | B |
| S_NEWS | Newscaster Style | 7 | 42.9% | 77.1% | B |
| S_PLAY | Playfulness | 7 | 31.4% | 60.0% | C |
| S_RANT | Rant Style | 7 | 54.3% | 85.7% | A |
| S_STRY | Storytelling Style | 7 | 60.6% | 84.8% | B |
| S_TECH | Technical Style | 7 | 58.1% | 83.9% | B |
| S_WHIS | Whisper Style | 7 | 44.1% | 61.8% | C |
| TEMP | Tempo | 7 | 42.4% | 63.6% | C |
| TENS | Tension | 7 | 56.2% | 75.0% | B |
| VALN | Valence | 7 | 60.0% | 77.1% | B |
| VALS | Valence Shift | 7 | 28.6% | 62.9% | C |
| VFLX | Velocity Flex | 7 | 57.1% | 94.3% | A |
| VOLT | Volatility | 7 | 57.1% | 77.1% | B |
| VULN | Vulnerability | 7 | 31.4% | 57.1% | C |
| WARM | Warmth | 7 | 60.7% | 89.3% | A |
All 57 Dimensions
Temporal Dynamics (6 dimensions)
TEMP β Tempo (Β±1: 63.6%) Mechanical speed of word/syllable production.
| Value | Short Tag | Description |
|---|---|---|
| 0 | glacially slow | Glacially slow, syllables stretched to breaking point |
| 1 | heavily deliberate | Unusually deliberate and labored word production |
| 2 | relaxed unhurried | Relaxed, slightly below conversational average |
| 3 | standard conversational | Standard everyday speech tempo |
| 4 | brisk elevated | Brisk, high-engagement forward push |
| 5 | noticeably compressed fast | Words compressed, high-density acoustic stream |
| 6 | blistering hyper-accelerated | Absolute human limit of linguistic speed |
CHNK β Chunking (Β±1: 69.7%) Breath unit grouping and pause frequency.
| Value | Short Tag | Description |
|---|---|---|
| 0 | severely fragmented syllables | Single syllables broken by massive gaps |
| 1 | very choppy stop-go | Short choppy two-word bursts |
| 2 | consistently shorter pedantic | Cautious, highly separated word groupings |
| 3 | naturally medium balanced | Natural sentence-sized breath groups |
| 4 | noticeably extended sweeping | Extended multi-sentence breath units |
| 5 | very long dense | Very long dense streams with brief inhalations |
| 6 | massive continuous unbroken | Continuous wall of words, no pauses |
SMTH β Smoothness (Β±1: 88.6%) Timing regularity and transition fluidity.
| Value | Short Tag | Description |
|---|---|---|
| 0 | completely chaotic spasmodic | Chaotic, spasmodic timing with random glitches |
| 1 | sharply detached staccato | Sharp staccato machine-gun-like delivery |
| 2 | uneven bumpy clumsy | Uneven, bumpy with micro-hesitations |
| 3 | standard naturally flexible | Standard natural rhythm with organic flexibility |
| 4 | noticeably consistent practiced | Consistent, practiced, professional timing |
| 5 | flowing silky legato | Silky legato with seamless transitions |
| 6 | mathematically perfect metronomic | Mathematically perfect, metronomic timing |
VFLX β Velocity Flex (Β±1: 94.3%) Change in speech speed across the clip.
| Value | Short Tag | Description |
|---|---|---|
| 0 | massive deceleration grinding | Massive deceleration, grinding to a halt |
| 1 | heavy sustained slowdown | Heavy, sustained slowdown |
| 2 | subtle natural easing | Subtle natural easing of pace |
| 3 | perfectly locked steady | Perfectly steady, locked tempo |
| 4 | subtle forward lean | Subtle forward acceleration |
| 5 | clear sustained acceleration | Clear sustained acceleration |
| 6 | extreme acceleration explosion | Extreme acceleration explosion |
ARSH β Arousal Shift (Β±1: 88.6%) Change in autonomic energy across the clip.
| Value | Short Tag | Description |
|---|---|---|
| 0 | total catastrophic collapse | From high activation to complete collapse |
| 1 | massive rapid de-escalation | Significant drop in energy |
| 2 | subtle gentle settling | Comforting settling of tension |
| 3 | completely static unchanging | Constant arousal level |
| 4 | subtle engaging perking | Light rising alertness |
| 5 | clear aggressive escalation | Aggressive energy escalation |
| 6 | violent explosive detonation | 0-to-100 explosion into panic |
DARC β Dynamic Arc (Β±1: 67.6%) Loudness trajectory across the clip.
| Value | Short Tag | Description |
|---|---|---|
| 0 | total catastrophic fade | Fade from loud to inaudible |
| 1 | clear pronounced diminuendo | Clear diminuendo |
| 2 | gentle natural softening | Gentle natural softening |
| 3 | absolutely flatlined constant | Perfectly constant volume |
| 4 | controlled satisfying bell | Controlled bell-curve shape |
| 5 | clear aggressive crescendo | Aggressive crescendo |
| 6 | impossibly extreme violent | Extreme volume explosion |
Prosody & Pitch (4 dimensions)
RANG β Pitch Range (Β±1: 76.7%) Vertical movement of fundamental frequency.
| Value | Short Tag | Description |
|---|---|---|
| 0 | perfectly flat monotone | Zero pitch variation, pure drone |
| 1 | severely suppressed | Microscopic pitch movement |
| 2 | tightly restrained narrow | Narrow, controlled pitch window |
| 3 | naturally balanced | Standard conversational intonation |
| 4 | expressively wide | Wide, colorful pitch movement |
| 5 | highly melodic sweeping | Sweeping dramatic pitch jumps |
| 6 | wildly operatic extreme | Wild operatic pitch extremes |
EMPH β Emphasis (Β±1: 74.3%) Word stress and informational hierarchy.
| Value | Short Tag | Description |
|---|---|---|
| 0 | flat monotone | Every word identical weight |
| 1 | barely stressed | Nearly undetectable stress shifts |
| 2 | softly highlighted | Light, polite word highlighting |
| 3 | naturally stressed | Clear natural emphasis pattern |
| 4 | strongly marked | Strong acoustic hierarchy |
| 5 | aggressively punched | Aggressive percussive stress |
| 6 | violently explosive | Violent explosive emphasis |
REGS β Register (Β±1: 88.6%) Vocal register from bass to soprano.
| Value | Short Tag | Description |
|---|---|---|
| 0 | basso profondo extreme | Extreme low bass register |
| 1 | baritone grounded warm | Warm grounded baritone |
| 2 | tenor bright lifted | Bright lifted tenor |
| 3 | bridged contralto countertenor | Bridged middle register |
| 4 | mezzo-soprano balanced | Balanced mezzo-soprano |
| 5 | soprano bright brilliant | Bright brilliant soprano |
| 6 | coloratura whistle extreme | Extreme high whistle register |
VOLT β Volatility (Β±1: 77.1%) Stability/instability of vocal parameters.
| Value | Short Tag | Description |
|---|---|---|
| 0 | absolutely frozen static | Frozen, perfectly static voice |
| 1 | highly steady unshakeable | Rock-steady, unshakeable |
| 2 | stable minor shifts | Stable with minor shifts |
| 3 | natural organic breathing | Natural organic variation |
| 4 | minor emotional flickering | Emotional flickering |
| 5 | highly unstable jarring | Highly unstable and jarring |
| 6 | completely chaotic cycling | Completely chaotic cycling |
Articulation & Fluency (4 dimensions)
CLRT β Articulation (Β±1: 71.4%) Consonant/vowel clarity and precision.
| Value | Short Tag | Description |
|---|---|---|
| 0 | indecipherable blurry hum | Indecipherable blurry hum |
| 1 | severely swallowed mumbled | Severely mumbled |
| 2 | consistently soft relaxed | Soft, relaxed articulation |
| 3 | neutral standard clear | Standard clear speech |
| 4 | crisp distinct professional | Crisp professional clarity |
| 5 | incredibly precise deliberate | Incredibly precise diction |
| 6 | hyper-articulated exaggerated | Hyper-articulated, exaggerated |
DFLU β Disfluency (Β±1: 82.9%) Fillers, false starts, self-corrections.
| Value | Short Tag | Description |
|---|---|---|
| 0 | pristine flawless perfect | Pristine, zero hesitations |
| 1 | highly polished professional | Highly polished delivery |
| 2 | highly fluent organic | Fluent with tiny organic imperfections |
| 3 | standard natural baseline | Standard natural filler rate |
| 4 | noticeably hesitant staggered | Noticeably hesitant and staggered |
| 5 | overwhelmingly messy chaotic | Overwhelmingly messy delivery |
| 6 | entirely shattered incoherent | Shattered, incoherent speech |
ATCK β Attack (Β±1: 97.1%) Onset quality of phonation.
| Value | Short Tag | Description |
|---|---|---|
| 0 | ghostly imperceptible fade | Ghostly fade-in onset |
| 1 | breathy diffused onset | Breathy, diffused onset |
| 2 | soft polite gentle | Soft, gentle onset |
| 3 | neutral standard balanced | Neutral balanced onset |
| 4 | hard clear square | Hard, square onset |
| 5 | violent percussive bark | Violent percussive bark |
| 6 | explosive glottal slam | Explosive glottal slam |
COGL β Cognitive Load (Β±1: 60.0%) Mental processing effort audible in delivery.
| Value | Short Tag | Description |
|---|---|---|
| 0 | perfectly fluid effortless | Perfectly fluid, zero effort |
| 1 | highly articulate efficient | Highly efficient processing |
| 2 | standard healthy natural | Standard healthy delivery |
| 3 | noticeable active searching | Noticeable word-searching |
| 4 | heavily burdened struggling | Heavily burdened, struggling |
| 5 | severely overloaded fracturing | Severely overloaded |
| 6 | catastrophically overwhelmed breakdown | Complete cognitive breakdown |
Voice Quality (7 dimensions)
ROUG β Roughness (Β±1: 75.9%) Vocal fold irregularity and texture.
| Value | Short Tag | Description |
|---|---|---|
| 0 | impossibly smooth pure | Impossibly smooth, pure tone |
| 1 | exceptionally velvety consistent | Velvety, consistent texture |
| 2 | standard healthy texture | Standard healthy vocal texture |
| 3 | distinctly grainy modulated | Distinctly grainy |
| 4 | heavily raspy weathered | Heavily raspy and weathered |
| 5 | aggressively harsh growling | Aggressively harsh growling |
| 6 | violently shredding chaotic | Violently shredding, chaotic |
TENS β Tension (Β±1: 75.0%) Muscular tension in the vocal apparatus.
| Value | Short Tag | Description |
|---|---|---|
| 0 | completely relaxed floppy | Completely relaxed, floppy |
| 1 | loose comfortable warmth | Loose, comfortable warmth |
| 2 | neutral conversational firmness | Neutral conversational firmness |
| 3 | mild muscular edge | Mild muscular edge |
| 4 | pressed highly restricted | Pressed, highly restricted |
| 5 | heavily strained grinding | Heavily strained, grinding |
| 6 | rigidly locked strangled | Rigidly locked, strangled |
BRGT β Brightness (Β±1: 79.3%) High-frequency spectral energy.
| Value | Short Tag | Description |
|---|---|---|
| 0 | completely muffled dark | Completely muffled, dark |
| 1 | severely reduced woolly | Severely reduced, woolly |
| 2 | gently attenuated warm | Gently warm, attenuated highs |
| 3 | perfectly neutral balanced | Perfectly balanced spectrum |
| 4 | well-defined crisp modern | Crisp, well-defined presence |
| 5 | heavily emphasized brilliant | Brilliant, heavily emphasized highs |
| 6 | overwhelmingly harsh piercing | Overwhelmingly harsh, piercing |
WARM β Warmth (Β±1: 89.3%) Low-mid frequency richness and body.
| Value | Short Tag | Description |
|---|---|---|
| 0 | surgically sterile cold | Surgically cold, sterile |
| 1 | tinny hollow top-heavy | Tinny, hollow |
| 2 | neutral functional cool | Neutral, functional |
| 3 | perfectly balanced baseline | Perfectly balanced |
| 4 | pleasant woody cozy | Pleasant, woody warmth |
| 5 | rich velvety late-night | Rich velvety late-night tone |
| 6 | overwhelmingly melting enveloping | Overwhelmingly enveloping warmth |
FULL β Fullness/Body (Β±1: 61.5%) Spectral width and harmonic density.
| Value | Short Tag | Description |
|---|---|---|
| 0 | paper-thin sliver | Paper-thin sliver of sound |
| 1 | highly restricted narrow | Highly restricted, narrow |
| 2 | slender lightweight | Slender, lightweight |
| 3 | naturally healthy | Natural healthy body |
| 4 | incredibly rich wide | Incredibly rich and wide |
| 5 | massive commanding | Massive, commanding |
| 6 | imax overwhelming | IMAX-like overwhelming body |
HARM β Harmonicity (Β±1: 71.0%) Harmonic-to-noise ratio and tonal purity.
| Value | Short Tag | Description |
|---|---|---|
| 0 | pure white noise | Pure noise, no harmonics |
| 1 | ghostly whisper | Ghostly whisper |
| 2 | breathy diffused | Breathy, diffused harmonics |
| 3 | naturally balanced | Naturally balanced |
| 4 | highly resonant focused | Highly resonant, focused |
| 5 | bell-like pure tonal | Bell-like pure tone |
| 6 | digitally synthetic | Digitally synthetic purity |
METL β Metallicness (Β±1: 82.4%) Metallic/ringing spectral quality.
| Value | Short Tag | Description |
|---|---|---|
| 0 | organic impossibly soft | Organic, impossibly soft |
| 1 | earthy negligible ping | Earthy, negligible ring |
| 2 | standard organic | Standard organic texture |
| 3 | subtly firm metallic | Subtle metallic edge |
| 4 | clearly clanging ring | Clearly metallic ring |
| 5 | fiercely piercing clanging | Fiercely piercing, clanging |
| 6 | pure robotic steel | Pure robotic steel |
Resonance (7 dimensions)
R_CHST β Chest Resonance (Β±1: 88.2%) | 0: weightless | 1: faintest warmth | 2: mild chest | 3: balanced grounded | 4: pronounced chest | 5: massive booming | 6: floor-shaking sub-bass |
R_HEAD β Head Resonance (Β±1: 73.5%) | 0: body-locked | 1: microscopic lift | 2: gentle upper | 3: floating gentle | 4: airy lofty | 5: falsetto | 6: extreme whistle |
R_MASK β Mask Resonance (Β±1: 90.6%) | 0: dull dead | 1: microscopic focus | 2: present clarity | 3: pingy cutting | 4: metallic theatrical | 5: piercing drilled | 6: laser siren |
R_MIXD β Mixed Resonance (Β±1: 66.7%) | 0: unbalanced fractured | 1: struggling blend | 2: split un-integrated | 3: connected unified | 4: rich blended | 5: full-spectrum | 6: impossibly huge |
R_NASL β Nasal Resonance (Β±1: 75.0%) | 0: blocked denasal | 1: clean pure | 2: subtle twang | 3: obvious pinched | 4: deliberate pinched | 5: unpleasant piercing | 6: hypernasal bleating |
R_ORAL β Oral Resonance (Β±1: 56.2%) | 0: displaced non-oral | 1: vague unfocused | 2: mostly oral | 3: balanced neutral | 4: clear forward | 5: exaggerated open | 6: megaphone projection |
R_THRT β Throat Resonance (Β±1: 77.1%) | 0: open relaxed | 1: faint coloring | 2: pharyngeal pulled | 3: centered guttural | 4: pharyngeal choked | 5: strangled distorted | 6: imploded swallowed |
Speaker Attributes (5 dimensions)
AGEV β Age of Voice (Β±1: 68.6%) | 0: neonatal | 1: child | 2: adolescent | 3: peak adult | 4: matured | 5: middle aging | 6: extreme senescence |
GEND β Gender Presentation (Β±1: 71.4%) | 0: hyper-feminine soprano | 1: clearly feminine | 2: feminine alto | 3: androgynous | 4: masculine tenor | 5: standard masculine | 6: hyper-masculine bass |
AROU β Arousal (Β±1: 82.9%) | 0: comatose | 1: deeply sedate | 2: reserved controlled | 3: baseline alert | 4: elevated excited | 5: highly intense | 6: maximum screaming |
VALN β Valence (Β±1: 77.1%) | 0: pure suffering | 1: heavily saddened | 2: subtly negative | 3: neutral baseline | 4: politely warm | 5: cheerful happiness | 6: euphoric joy |
VALS β Valence Shift (Β±1: 62.9%) | 0: tragic plunge | 1: mood darkening | 2: subtle worry | 3: perfectly stable | 4: gentle brightening | 5: powerful uplift | 6: euphoric redemption |
Delivery & Structure (5 dimensions)
STRU β Structure (Β±1: 90.3%) | 0: scattered disconnected | 1: jumpy erratic | 2: loose rambling | 3: basic coherent | 4: linear organized | 5: organized signposted | 6: perfect masterclass |
STNC β Stance (Β±1: 74.3%) | 0: tiny submissive | 1: reserved cautious | 2: cooperative friendly | 3: neutral informative | 4: assertive confident | 5: authoritative commanding | 6: dominating aggressive |
FOCS β Focus/Engagement (Β±1: 80.0%) | 0: deeply dissociated | 1: emotionally detached | 2: distracted split | 3: politely engaged | 4: clearly engaged | 5: laser-focused | 6: obsessively hypnotic |
VULN β Vulnerability (Β±1: 57.1%) | 0: armored impenetrable | 1: guarded professional | 2: composed adult | 3: open accessible | 4: empathetic permeable | 5: completely raw | 6: profoundly naked |
RESP β Respiratory Audibility (Β±1: 85.7%) | 0: invisible | 1: controlled micro | 2: natural conversational | 3: noticeably elevated | 4: highly labored | 5: rapid gasping | 6: catastrophic failure |
Style Dimensions (15 dimensions)
S_AUTH β Authoritative (Β±1: 97.1%) | 0: trembling submissive | 1: weakly uncertain | 2: peer suggestion | 3: firm directive | 4: commanding forceful | 5: aggressive dominance | 6: terrifying assault |
S_FORM β Formality (Β±1: 90.9%) | 0: vulgar aggressive | 1: casual everyday | 2: service polite | 3: boardroom professional | 4: official rigid | 5: ceremonial grave | 6: royal imperial |
S_NARR β Narration (Β±1: 80.6%) | 0: stumbling confused | 1: child classroom | 2: dry academic | 3: warm engaging | 4: audiobook elite | 5: epic cinematic | 6: god voice authority |
S_STRY β Storytelling (Β±1: 84.8%) | 0: robotic data reader | 1: dry facts only | 2: casual bar story | 3: campfire storyteller | 4: professional voice actor | 5: epic fantasy bard | 6: hypnotic ancient bard |
S_DRAM β Dramatic (Β±1: 42.9%) | 0: deadpan bored | 1: bad flat acting | 2: cinematic subtle | 3: stage emotional | 4: shakespearean epic | 5: melodramatic soap | 6: operatic meltdown |
S_CART β Cartoon (Β±1: 85.7%) | 0: hyper-realistic | 1: completely natural | 2: party impression | 3: sitcom character | 4: wacky energetic | 5: zany chaotic | 6: looney tunes extreme |
S_NEWS β Newscaster (Β±1: 77.1%) | 0: rambling chaotic | 1: casual conversational | 2: youtube opinionated | 3: field reporter | 4: studio anchor | 5: gravitas authoritative | 6: 70s exaggerated |
S_TECH β Technical (Β±1: 83.9%) | 0: obscure jargon rush | 1: unclear rapid lecturing | 2: dry professor | 3: engaging tutorial host | 4: precise emergency manual | 5: condescending mansplain | 6: hyperactive preschool |
S_CONV β Conversational (Β±1: 84.8%) | 0: aggressive monologue | 1: detached uninteractive | 2: internal soliloquy | 3: balanced interactive | 4: warmly engaged | 5: eager checking | 6: manic extrovert |
S_CASU β Casual (Β±1: 58.6%) | 0: robotic rigid | 1: nervously stiff | 2: polite guarded | 3: relaxed friendly | 4: loose informal | 5: lazy mumbled | 6: drunk slacker |
S_RANT β Rant (Β±1: 85.7%) | 0: diplomatic soothing | 1: calm rational | 2: passive annoyed | 3: energetic venting | 4: heated aggressive | 5: political tirade | 6: manic rage |
S_PLAY β Playfulness (Β±1: 60.0%) | 0: grim funeral | 1: serious business | 2: deadpan ironic | 3: friendly warm | 4: joyful joking | 5: mischievous teasing | 6: whimsical ecstatic |
S_ASMR β ASMR (Β±1: 75.0%) | 0: stadium screaming | 1: public projection | 2: hushed intimate | 3: warm soothing | 4: pure whisper airy | 5: textural tingles | 6: hyper-intimate binaural |
S_WHIS β Whisper (Β±1: 61.8%) | 0: full-voiced projection | 1: polite indoor voice | 2: intimate breathy | 3: piercing stage whisper | 4: library whisper | 5: rasping batman growl | 6: ghostly ethereal |
S_MONO β Monologue (Β±1: 80.0%) | 0: interactive dialogue | 1: public formal address | 2: thinking aloud | 3: diary intimate | 4: theatrical tragic | 5: noir voiceover | 6: noir detective extreme |
Recording & Content (4 dimensions)
RCQL β Recording Quality (Β±1: 48.6%) | 0: severely corrupted | 1: amateur flawed | 2: below average | 3: standard consumer | 4: good prosumer | 5: broadcast pristine | 6: audiophile flawless |
BKGN β Background Noise (Β±1: 68.6%) | 0: absolute vacuum | 1: imperceptible quiet | 2: slight hiss | 3: noticeable ambience | 4: highly noisy | 5: chaotic overwhelming | 6: catastrophic masking |
ESTH β Aesthetic Quality (Β±1: 85.7%) | 0: viscerally repulsive | 1: deeply grating | 2: mundane boring | 3: acceptably standard | 4: genuinely pleasant | 5: star power radiant | 6: sublimely transcendent |
EXPL β Explicitness (Β±1: 77.1%) | 0: completely innocent | 1: mildly casual | 2: moderately mature | 3: distinctly restricted | 4: highly explicit | 5: extremely graphic | 6: deeply illegal |
Training Details
- Data source: Audio samples from voice taxonomy bucket reports, validated by Qwen3-Omni audio model
- Encoder features: BUD-E-Whisper V1.0 + V1.1 mean-pooled temporal halves β 3072-dim
- Dimensionality reduction: Per-dimension PCA from 3072 β 96 components
- Classifier: 1-hidden-layer MLP with ReLU (96 β 96 β n_classes)
- Training: Adam optimizer, lr=1e-3, max 200 epochs with early stopping (patience=20)
- Loss: Cross-entropy with inverse-frequency class weighting
- Validation: Stratified holdout (5 per class)
- Total parameters: ~570K across all 57 classifiers + 57 PCA matrices
File Structure
Voice-Taxonomy-57/
βββ README.md # This file
βββ inference.py # Standalone inference pipeline
βββ taxonomy_classifiers.pkl # 57 MLP classifiers + PCA weights (~68 MB)
βββ taxonomy_tags_short.json # Short 2-3 word tags per bucket
βββ taxonomy_tags_sentences.json # 5-10 word sentence tags per bucket
βββ taxonomy_dimensions.json # Full dimension descriptions
βββ config.json # Model configuration
βββ requirements.txt # Dependencies
Related Models
- laion/BUD-E-Whisper β V1.0 encoder
- laion/BUD-E-Whisper_V1.1 β V1.1 encoder
- laion/Empathic-Insight-Voice-Plus β 55 emotion + 4 quality dimensions (shares V1.0 encoder)
Citation
@misc{voice-taxonomy-57,
title={Voice-Taxonomy-57: 57-Dimension Voice Taxonomy Classifier},
author={LAION},
year={2025},
url={https://huggingface.co/laion/Voice-Taxonomy-57}
}
License
Apache 2.0
- Downloads last month
- 21