Mirror from Khubaib01/ECAPA-TDNN-VHE

Browse files

Files changed (4) hide show

.gitattributes +1 -0
ECAPA_TDNN_VHE.pth +3 -0
README.md +178 -0
radar_chart.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+radar_chart.png filter=lfs diff=lfs merge=lfs -text

ECAPA_TDNN_VHE.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:145325909e3e53c13bbb351537117727f4caf34828aea9c2e55b1d0f7262bfc6
+size 9208363

README.md ADDED Viewed

	@@ -0,0 +1,178 @@

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- speechbrain/spkrec-ecapa-voxceleb
+tags:
+  - speaker-embedding
+  - vocal-fatigue
+  - voice-health
+  - ecapa-tdnn
+  - vhe
+  - pytorch
+  - auralis-vfs
+  - audio-processing
+  - voice-analysis
+  - research
+---
+# ECAPA-TDNN-VHE: Vocal Health Encoder
+## Model Details
+- **Model name:** ECAPA-TDNN-VHE
+- **Author:** Muhammad Khubaib Ahmad et al.
+- **License:** Apache 2.0
+- **Framework:** PyTorch, SpeechBrain
+- **Embedding dimensionality:** 192
+- **Sampling rate:** 16 kHz (mono)
+- **Task:** Health-centric vocal fatigue representation learning
+- **Paper / Citation:**
+Ahmad, M. K. (2026). *Modeling Vocal Fatigue as Embedding-Space Deviation Using Contrastively Trained ECAPA-TDNNs*. Zenodo. https://doi.org/10.5281/zenodo.18366305
+---
+## Model Description
+ECAPA-TDNN-VHE (Vocal Health Encoder) is a research-grade deep neural speech encoder developed in the research of Muhammad Khubaib Ahmad for generating health-centric, speaker-invariant vocal embeddings. Unlike conventional speaker embedding models optimized for identity discrimination, ECAPA-TDNN-VHE is trained from scratch using supervised contrastive learning, explicitly promoting separation between vocal health states while minimizing speaker-specific information.
+Empirical evaluation demonstrates that ECAPA-TDNN-VHE **outperforms** the baseline ECAPA-TDNN by over **2.5×** in classification accuracy and F1-score on vocal health benchmarks, establishing it as a state-of-the-art model for health-oriented speech representation learning in ECAPA-TDNN based architectures.
+The encoder forms the core of the **Auralis** MLOps framework and is accessible via the open-source Python library **auralis_vfs**, enabling reproducible and real-time vocal fatigue scoring for research and applied scenarios.
+Key capabilities include:
+- **192-dimensional embeddings** capturing health-relevant characteristics (strain, stress, fatigue).
+- Continuous **vocal fatigue scoring** relative to a centroid of healthy embeddings (*fatigue axis*).
+- Integration into **Auralis**, a robust MLOps system for real-time vocal fatigue monitoring.
+- Accessible via the Python library [`auralis_vfs`](https://pypi.org/project/auralis-vfs/), enabling researchers to compute fatigue scores from audio files (`.wav`, `.mp3`, `.m4a`).
+This model represents a **state-of-the-art (SOTA) approach for ECAPA-based health embeddings**, outperforming conventional ECAPA-TDNN trained for speaker recognition.
+---
+## Intended Use
+### Primary Use Cases
+- Vocal fatigue monitoring for occupational voice users
+- Health-centric speech embedding extraction
+- Longitudinal voice health tracking
+- Feature extraction for downstream clinical models
+- Computational paralinguistics research
+### Out-of-Scope
+- Speaker identification or verification
+- Emotion recognition without retraining
+- Medical diagnosis without professional oversight
+---
+## Training Data
+- Real-world dataset: **~1.5 hours of speech from 70+ speakers**
+- Labels: Healthy, Strained, Stressed
+- Diverse microphones, devices, acoustic environments
+- Gender-balanced, language-independent
+- Preprocessed audio: **16 kHz, mono**, duration 5–10 seconds
+---
+## Training Procedure
+- **Architecture:** ECAPA-TDNN
+- **Training objective:** Supervised contrastive loss for health-state separability while minimizing speaker identity leakage
+- **Embedding dimension:** 192
+- **Optimizer:** Adam
+- **Initialization:** Trained from scratch
+---
+## Evaluation
+### Benchmarking Against Baseline ECAPA-TDNN
+The model was evaluated on vocal health classification tasks. Results highlight **ECAPA-TDNN-VHE's superiority over baseline ECAPA-TDNN**:
+| Model | Accuracy | Macro F1 | Healthy F1 | Strained F1 | Stressed F1 |
+|------|----------|----------|------------|-------------|-------------|
+| ECAPA-TDNN (SpeechBrain baseline) | 0.36 | 0.31 | 0.50 | 0.22 | 0.22 |
+| **ECAPA-TDNN-VHE (Khubaib et al., 2026)** | **0.78** | **0.77** | **0.85** | **0.78** | **0.70** |
+This demonstrates **state-of-the-art health-centric embedding performance** within ECAPA-based architectures.
+---
+## 📊 Radar Chart: Embedding Quality Comparison
+![Radar_chart](radar_chart.png)
+- Precision
+- Recall
+- F1-score
+- Inter-class separation
+- Intra-class compactness
+> **Figure 1:** Radar chart comparing baseline ECAPA-TDNN and ECAPA-TDNN-VHE across classification and embedding quality metrics.
+---
+## 🏆 Leaderboard (Evaluated Models)
+| Rank | Model | Accuracy | Macro F1 |
+|------|-------|----------|----------|
+| **1** | **ECAPA-TDNN-VHE (Muhammad Khubaib Ahmad et al., 2026)** | **0.78** | **0.77** |
+| 2 | ECAPA-TDNN (SpeechBrain baseline) | 0.36 | 0.31 |
+> Leaderboard reflects performance on the vocal health dataset and serves as a **research benchmark**, not a universal ranking.
+---
+## Inference
+The model can be used via the Python library `auralis_vfs`:
+```bash
+pip install auralis_vfs
+```
+Example usage:
+```python
+from auralis.scorer import score_audio, score_waveform
+# Score from a waveform array
+score = score_waveform(audio_array)
+# Score from an audio file
+score = score_audio("sample.wav")
+print(f"Vocal fatigue score: {score:.2f}")
+```
+The model is also deployed in the Auralis MLOps system, providing real-time fatigue monitoring and embedding-based analyses.
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{muhammad_khubaib_ahmad_2026,
+	author       = { Muhammad Khubaib Ahmad },
+	title        = { ECAPA-TDNN-VHE (Revision 871292d) },
+	year         = 2026,
+	url          = { https://huggingface.co/Khubaib01/ECAPA-TDNN-VHE },
+	doi          = { 10.57967/hf/7648 },
+	publisher    = { Hugging Face }
+}
+```
+## Future Work
+- Integration of prosody features to enhance fatigue detection
+- Automatic generation of clinical-style reports
+- Expansion to larger, multi-lingual datasets
+- Longitudinal tracking of speaker fatigue trends
+## Acknowledgments
+The author gratefully acknowledge the participants for allowing us to use their voice in research and the author thank to the Data Manager(Faiez Ahmad) and Data collector(Muhammad Anas Tariq) for their incredible services and cooperation.

radar_chart.png ADDED Viewed

Git LFS Details

SHA256: 491faf9976045808a69719a801670e6e3daada6a22e7bf83d455f331a81ef538
Pointer size: 131 Bytes
Size of remote file: 165 kB