ECAPA-TDNN-VHE: Vocal Health Encoder

Model Details

Model name: ECAPA-TDNN-VHE
Author: Muhammad Khubaib Ahmad et al.
License: Apache 2.0
Framework: PyTorch, SpeechBrain
Embedding dimensionality: 192
Sampling rate: 16 kHz (mono)
Task: Health-centric vocal fatigue representation learning
Paper / Citation:
Ahmad, M. K. (2026). Modeling Vocal Fatigue as Embedding-Space Deviation Using Contrastively Trained ECAPA-TDNNs. Zenodo. https://doi.org/10.5281/zenodo.18366305

Model Description

ECAPA-TDNN-VHE (Vocal Health Encoder) is a research-grade deep neural speech encoder developed in the research of Muhammad Khubaib Ahmad for generating health-centric, speaker-invariant vocal embeddings. Unlike conventional speaker embedding models optimized for identity discrimination, ECAPA-TDNN-VHE is trained from scratch using supervised contrastive learning, explicitly promoting separation between vocal health states while minimizing speaker-specific information.

Empirical evaluation demonstrates that ECAPA-TDNN-VHE outperforms the baseline ECAPA-TDNN by over 2.5× in classification accuracy and F1-score on vocal health benchmarks, establishing it as a state-of-the-art model for health-oriented speech representation learning in ECAPA-TDNN based architectures.

The encoder forms the core of the Auralis MLOps framework and is accessible via the open-source Python library auralis_vfs, enabling reproducible and real-time vocal fatigue scoring for research and applied scenarios.

Key capabilities include:

192-dimensional embeddings capturing health-relevant characteristics (strain, stress, fatigue).
Continuous vocal fatigue scoring relative to a centroid of healthy embeddings (fatigue axis).
Integration into Auralis, a robust MLOps system for real-time vocal fatigue monitoring.
Accessible via the Python library auralis_vfs, enabling researchers to compute fatigue scores from audio files (.wav, .mp3, .m4a).

This model represents a state-of-the-art (SOTA) approach for ECAPA-based health embeddings, outperforming conventional ECAPA-TDNN trained for speaker recognition.

Intended Use

Primary Use Cases

Vocal fatigue monitoring for occupational voice users
Health-centric speech embedding extraction
Longitudinal voice health tracking
Feature extraction for downstream clinical models
Computational paralinguistics research

Out-of-Scope

Speaker identification or verification
Emotion recognition without retraining
Medical diagnosis without professional oversight

Training Data

Real-world dataset: ~1.5 hours of speech from 70+ speakers
Labels: Healthy, Strained, Stressed
Diverse microphones, devices, acoustic environments
Gender-balanced, language-independent
Preprocessed audio: 16 kHz, mono, duration 5–10 seconds

Training Procedure

Architecture: ECAPA-TDNN
Training objective: Supervised contrastive loss for health-state separability while minimizing speaker identity leakage
Embedding dimension: 192
Optimizer: Adam
Initialization: Trained from scratch

Evaluation

Benchmarking Against Baseline ECAPA-TDNN

The model was evaluated on vocal health classification tasks. Results highlight ECAPA-TDNN-VHE's superiority over baseline ECAPA-TDNN:

Model	Accuracy	Macro F1	Healthy F1	Strained F1	Stressed F1
ECAPA-TDNN (SpeechBrain baseline)	0.36	0.31	0.50	0.22	0.22
ECAPA-TDNN-VHE (Khubaib et al., 2026)	0.78	0.77	0.85	0.78	0.70

This demonstrates state-of-the-art health-centric embedding performance within ECAPA-based architectures.

📊 Radar Chart: Embedding Quality Comparison

Precision
Recall
F1-score
Inter-class separation
Intra-class compactness

Figure 1: Radar chart comparing baseline ECAPA-TDNN and ECAPA-TDNN-VHE across classification and embedding quality metrics.

🏆 Leaderboard (Evaluated Models)

Rank	Model	Accuracy	Macro F1
1	ECAPA-TDNN-VHE (Muhammad Khubaib Ahmad et al., 2026)	0.78	0.77
2	ECAPA-TDNN (SpeechBrain baseline)	0.36	0.31

Leaderboard reflects performance on the vocal health dataset and serves as a research benchmark, not a universal ranking.

Inference

The model can be used via the Python library auralis_vfs:

pip install auralis_vfs

Example usage:

from auralis.scorer import score_audio, score_waveform

# Score from a waveform array
score = score_waveform(audio_array)

# Score from an audio file
score = score_audio("sample.wav")
print(f"Vocal fatigue score: {score:.2f}")

The model is also deployed in the Auralis MLOps system, providing real-time fatigue monitoring and embedding-based analyses.

Citation

If you use this model in your research, please cite:

@misc{muhammad_khubaib_ahmad_2026,
    author       = { Muhammad Khubaib Ahmad },
    title        = { ECAPA-TDNN-VHE (Revision 871292d) },
    year         = 2026,
    url          = { https://huggingface.co/Khubaib01/ECAPA-TDNN-VHE },
    doi          = { 10.57967/hf/7648 },
    publisher    = { Hugging Face }
}

Future Work

Integration of prosody features to enhance fatigue detection
Automatic generation of clinical-style reports
Expansion to larger, multi-lingual datasets
Longitudinal tracking of speaker fatigue trends

Acknowledgments

The author gratefully acknowledge the participants for allowing us to use their voice in research and the author thank to the Data Manager(Faiez Ahmad) and Data collector(Muhammad Anas Tariq) for their incredible services and cooperation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Khubaib01/ECAPA-TDNN-VHE

Base model

speechbrain/spkrec-ecapa-voxceleb

Finetuned

(6)

this model