adelelsayed1991
/

AdaptiveRespiratoryModel

ONNX

English

Model card Files Files and versions

xet

Community

adelelsayed1991 commited on Feb 22

Commit

4ef4eef

verified ·

1 Parent(s): 1065131

Update README.md

Browse files

Files changed (1) hide show

README.md +181 -3

README.md CHANGED Viewed

@@ -1,3 +1,181 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+---
+---
+language: en
+tags:
+  - audio
+  - audio-classification
+  - respiratory-sounds
+  - healthcare
+  - medical
+  - hear
+  - vit
+  - lora
+  - pytorch
+license: apache-2.0
+datasets:
+  - SPRSound
+metrics:
+  - accuracy
+  - f1
+  - roc_auc
+base_model: google/hear-pytorch
+pipeline_tag: audio-classification
+---
+# HeAR-SPRSound: Respiratory Sound Abnormality Classifier
+## Model Summary
+A fine-tuned respiratory sound classifier built on top of **Google's HeAR** (Health Acoustic Representations) foundation model. The model performs **binary classification** — distinguishing **normal** from **abnormal** respiratory sounds — and is trained on the **SPRSound** dataset spanning BioCAS challenge years 2022–2025.
+The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a **Gated Attention Pooling** layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier.
+---
+## Architecture
+```
+Audio Input (16 kHz WAV)
+       ↓
+HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 × 192 × 128])
+       ↓
+HeAR ViT Encoder (google/hear-pytorch)
+  └─ LoRA adapters on Q & V projections in last 6 transformer blocks
+       ↓
+Per-chunk CLS Embeddings [B × T × 512]
+       ↓
+Gated Attention Pooling (length-masked softmax attention over chunks)
+       ↓
+Pooled Representation [B × 512]
+       ↓
+MLP Classifier (512 → 256 → 2, GELU, Dropout 0.4)
+       ↓
+Normal / Abnormal
+```
+**Key components:**
+- **Backbone**: `google/hear-pytorch` (frozen except LoRA layers + LayerNorms)
+- **LoRA**: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks
+- **Pooling**: Gated Attention Pool (dual-path tanh × sigmoid gating, hidden dim 512)
+- **Loss**: Focal Loss (γ=2.0) with class-balanced sample weighting
+- **Inference**: Per-class threshold optimization (one-vs-rest F1 on validation set)
+---
+## Training Details
+| Hyperparameter | Value |
+|---|---|
+| Base model | `google/hear-pytorch` |
+| Input sample rate | 16,000 Hz |
+| Chunk size | 2 seconds (32,000 samples) |
+| Max audio duration | 10 seconds (up to 5 chunks) |
+| Optimizer | AdamW |
+| Learning rate | 5e-5 |
+| Weight decay | 0.2 |
+| Warmup epochs | 10 |
+| Max epochs | 100 |
+| Batch size | 96 |
+| Early stopping patience | 20 epochs |
+---
+## Dataset
+**SPRSound** — multi-year BioCAS challenge respiratory auscultation dataset.
+| Year | Split |
+|---|---|
+| BioCAS 2022 | Train + Inter/Intra test |
+| BioCAS 2023 | Test |
+| BioCAS 2024 | Test |
+| BioCAS 2025 | Test |
+All data was **re-split at the patient level** (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme:
+- **normal**: all event annotations are "Normal"
+- **abnormal**: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.)
+Class imbalance was addressed through `WeightedRandomSampler` and Focal Loss.
+---
+## Data Augmentation
+A custom `PhoneLikeAugment` pipeline was applied during training (p=0.5) to simulate real-world acoustic variability:
+- Random gain (−18 to +8 dB)
+- Phone band-limiting (HP: 120–200 Hz, LP: 4–8 kHz)
+- Fast echo / room simulation (10–80 ms delay taps)
+- Colored noise addition (SNR 3–25 dB)
+- Soft AGC / tanh compression
+- Random time shift (±80 ms)
+- Rare clipping (p=0.15)
+---
+## Usage
+```python
+import torch
+import torchaudio
+from transformers import AutoModel
+# Load model
+model = AdaptiveRespiratoryModel(
+    num_classes=2,
+    dropout=0.4,
+    use_lora=True,
+    lora_r=16,
+    lora_alpha=16,
+    lora_dropout=0.3,
+    lora_last_n_blocks=6
+)
+checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
+model.load_state_dict(checkpoint["model"], strict=False)
+model.eval()
+# Audio must be 16 kHz, processed through HeAR's preprocess_audio
+# into chunks of shape [T, 1, 192, 128]
+```
+> ⚠️ Requires `google/hear-pytorch` and the [HEAR](https://github.com/Google-Health/hear) library for audio preprocessing.
+---
+## Limitations & Intended Use
+- **Intended use**: Research and prototyping in respiratory sound analysis. **Not validated for clinical use.**
+- The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations.
+- Binary classification only — does not distinguish between specific pathology types (e.g., wheeze vs. crackle).
+- Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains.
+---
+## Citation
+If you use this model, please cite the SPRSound dataset and the HeAR foundation model:
+```bibtex
+@misc{sprsound,
+  title   = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database},
+  year    = {2022},
+  note    = {BioCAS 2022–2025 challenge dataset}
+}
+@misc{hear2024,
+  title   = {HeAR: Health Acoustic Representations},
+  author  = {Google Health},
+  year    = {2024},
+  url     = {https://github.com/Google-Health/hear}
+}
+```
+---
+## License
+This model is released under the **Apache 2.0** license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms — please refer to the dataset authors.