ECAPA-TDNN-VHE
Health-Centric Speech Representation Model for Vocal Fatigue Analysis
Model Details
- Model Name: ECAPA-TDNN-VHE
- Version: v1.0
- Author: Muhammad Khubaib Ahmad
- Affiliation: Independent Researcher
- License: Apache License 2.0
- Framework: PyTorch
- Model Format:
.pth - Embedding Dimension: 192
- Projection Head: 128 (training only)
Overview
ECAPA-TDNN-VHE is a health-centric speech representation model trained using supervised contrastive learning to generate embeddings sensitive to vocal fatigue and strain while remaining robust to speaker identity, language, microphone type, and recording conditions.
The model does not perform medical diagnosis. It produces embeddings that enable relative and continuous vocal fatigue analysis by comparing test embeddings against reference centroids derived from healthy and fatigued speech.
Intended Use
Primary Intended Uses
- Vocal fatigue research
- Health-centric speech feature extraction
- Continuous fatigue scoring
- Longitudinal voice monitoring
- Downstream modeling for vocal health analysis
Target Users
- Speech and audio researchers
- Machine learning engineers
- Applied AI practitioners
Out-of-Scope Uses
- Clinical diagnosis
- Medical decision-making
- Disease detection or treatment
⚠️ This model is not intended for medical or clinical use.
Input Specifications
Audio Requirements
- Format: WAV (uncompressed)
- Sample Rate: 16 kHz
- Channels: Mono
- Minimum Duration: 5 seconds
- Maximum Duration: 10 seconds
Feature Extraction Parameters
- n_mels: 80
- n_fft: 400
- hop_length: 256
- Representation: Log-Mel Spectrogram
All inputs must follow these specifications for reliable results.
Output Specifications
- Output Type: Fixed-length embedding
- Dimension: 192
- Data Type: float64
- Normalization: L2-normalized
- Similarity Metric: Cosine similarity
The embeddings are speaker-independent and optimized for health-centric analysis.
Reference Centroids
The model uses reference centroids for relative fatigue estimation:
- Healthy centroid (C_h): Provided
- Fatigued centroid (C_f): Provided
Computation
C_h = E_h.mean(axis=0)
C_f = np.vstack([E_s, E_t]).mean(axis=0)
C_h /= np.linalg.norm(C_h) + 1e-8
C_f /= np.linalg.norm(C_f) + 1e-8
Where:
E_h = healthy embeddings
E_s = strained embeddings
E_t = stressed embeddings
Fatigue is estimated using cosine similarity and relative distance along the health–fatigue axis.
Training Data
Total Duration: ~1+ hours
Speakers: ~70–100
Gender Distribution: ~60% male, ~40% female
Languages: Language-independent
Devices: Multiple phones and microphones
Environments: Diverse background conditions
Labels
Healthy
Strained
Stressed
Labels describe vocal condition, not medical diagnosis.
Training Procedure
Base Architecture: ECAPA-TDNN (SpeechBrain)
Embedding Dimension: 192
Projection Head: 128
Loss Function: Supervised Contrastive Loss
Optimizer: AdamW
Learning Rate: 1e-4
Weight Decay: 1e-2
Epochs: 60
Hardware: 2× NVIDIA T4 GPUs
Supervised contrastive learning encourages compact clusters for similar vocal conditions and separation across fatigue levels.
Evaluation and Analysis
Embedding space visualized using UMAP
Clear separation observed between healthy and fatigued clusters
Robustness observed across:
Speakers
Languages
Recording devices
Background conditions
No clinical benchmarks are reported.
Limitations
Not clinically validated
Not suitable for medical diagnosis
Fatigue is modeled as a relative acoustic phenomenon
Short utterances (<5s) may reduce reliability
Cultural or stylistic vocal variations may influence embeddings
Ethical Considerations
No identity inference
No medical claims
Requires informed consent for voice data collection
Outputs should be interpreted as indicators, not conclusions
Deployment Notes
PyTorch: 2.1.1
Python: 3.10
CPU and GPU inference supported
Suitable for batch and real-time pipelines
ONNX export possible with minor adjustments
Applications of the Embeddings
Embeddings produced by ECAPA-TDNN-VHE can be used for:
Continuous vocal fatigue scoring
Early fatigue trend detection
Longitudinal vocal condition monitoring
Multimodal health research pipelines
Robust speech representation learning
Contrastive and transfer learning studies
Citation
@misc{ahmad2026ecapa_vhe,
title={ECAPA-TDNN-VHE: Health-Centric Speech Representations for Vocal Fatigue Analysis},
author={Ahmad, Muhammad Khubaib},
year={2026},
note={Independent Researcher}
}
- Downloads last month
- -
Model tree for Khubaib01/ECAPA-TDNN-VHE
Base model
speechbrain/spkrec-ecapa-voxceleb