ECAPA-TDNN-VHE

Health-Centric Speech Representation Model for Vocal Fatigue Analysis


Model Details

  • Model Name: ECAPA-TDNN-VHE
  • Version: v1.0
  • Author: Muhammad Khubaib Ahmad
  • Affiliation: Independent Researcher
  • License: Apache License 2.0
  • Framework: PyTorch
  • Model Format: .pth
  • Embedding Dimension: 192
  • Projection Head: 128 (training only)

Overview

ECAPA-TDNN-VHE is a health-centric speech representation model trained using supervised contrastive learning to generate embeddings sensitive to vocal fatigue and strain while remaining robust to speaker identity, language, microphone type, and recording conditions.

The model does not perform medical diagnosis. It produces embeddings that enable relative and continuous vocal fatigue analysis by comparing test embeddings against reference centroids derived from healthy and fatigued speech.


Intended Use

Primary Intended Uses

  • Vocal fatigue research
  • Health-centric speech feature extraction
  • Continuous fatigue scoring
  • Longitudinal voice monitoring
  • Downstream modeling for vocal health analysis

Target Users

  • Speech and audio researchers
  • Machine learning engineers
  • Applied AI practitioners

Out-of-Scope Uses

  • Clinical diagnosis
  • Medical decision-making
  • Disease detection or treatment

⚠️ This model is not intended for medical or clinical use.


Input Specifications

Audio Requirements

  • Format: WAV (uncompressed)
  • Sample Rate: 16 kHz
  • Channels: Mono
  • Minimum Duration: 5 seconds
  • Maximum Duration: 10 seconds

Feature Extraction Parameters

  • n_mels: 80
  • n_fft: 400
  • hop_length: 256
  • Representation: Log-Mel Spectrogram

All inputs must follow these specifications for reliable results.


Output Specifications

  • Output Type: Fixed-length embedding
  • Dimension: 192
  • Data Type: float64
  • Normalization: L2-normalized
  • Similarity Metric: Cosine similarity

The embeddings are speaker-independent and optimized for health-centric analysis.


Reference Centroids

The model uses reference centroids for relative fatigue estimation:

  • Healthy centroid (C_h): Provided
  • Fatigued centroid (C_f): Provided

Computation

C_h = E_h.mean(axis=0)
C_f = np.vstack([E_s, E_t]).mean(axis=0)

C_h /= np.linalg.norm(C_h) + 1e-8
C_f /= np.linalg.norm(C_f) + 1e-8

Where:

E_h = healthy embeddings

E_s = strained embeddings

E_t = stressed embeddings

Fatigue is estimated using cosine similarity and relative distance along the health–fatigue axis.

Training Data

  • Total Duration: ~1+ hours

  • Speakers: ~70–100

  • Gender Distribution: ~60% male, ~40% female

  • Languages: Language-independent

  • Devices: Multiple phones and microphones

  • Environments: Diverse background conditions

  • Labels

    • Healthy

    • Strained

    • Stressed

Labels describe vocal condition, not medical diagnosis.

Training Procedure

  • Base Architecture: ECAPA-TDNN (SpeechBrain)

  • Embedding Dimension: 192

  • Projection Head: 128

  • Loss Function: Supervised Contrastive Loss

  • Optimizer: AdamW

  • Learning Rate: 1e-4

  • Weight Decay: 1e-2

  • Epochs: 60

  • Hardware: 2× NVIDIA T4 GPUs

  • Supervised contrastive learning encourages compact clusters for similar vocal conditions and separation across fatigue levels.

  • Evaluation and Analysis

  • Embedding space visualized using UMAP

  • Clear separation observed between healthy and fatigued clusters

Robustness observed across:

  • Speakers

  • Languages

  • Recording devices

  • Background conditions

  • No clinical benchmarks are reported.

Limitations

  • Not clinically validated

  • Not suitable for medical diagnosis

  • Fatigue is modeled as a relative acoustic phenomenon

  • Short utterances (<5s) may reduce reliability

  • Cultural or stylistic vocal variations may influence embeddings

Ethical Considerations

  • No identity inference

  • No medical claims

  • Requires informed consent for voice data collection

  • Outputs should be interpreted as indicators, not conclusions

Deployment Notes

  • PyTorch: 2.1.1

  • Python: 3.10

  • CPU and GPU inference supported

  • Suitable for batch and real-time pipelines

  • ONNX export possible with minor adjustments

Applications of the Embeddings

  • Embeddings produced by ECAPA-TDNN-VHE can be used for:

  • Continuous vocal fatigue scoring

  • Early fatigue trend detection

  • Longitudinal vocal condition monitoring

  • Multimodal health research pipelines

  • Robust speech representation learning

  • Contrastive and transfer learning studies

Citation

@misc{ahmad2026ecapa_vhe,
  title={ECAPA-TDNN-VHE: Health-Centric Speech Representations for Vocal Fatigue Analysis},
  author={Ahmad, Muhammad Khubaib},
  year={2026},
  note={Independent Researcher}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Khubaib01/ECAPA-TDNN-VHE

Finetuned
(6)
this model