File size: 5,344 Bytes
4ef4eef | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | ---
license: apache-2.0
language:
- en
---
---
language: en
tags:
- audio
- audio-classification
- respiratory-sounds
- healthcare
- medical
- hear
- vit
- lora
- pytorch
license: apache-2.0
datasets:
- SPRSound
metrics:
- accuracy
- f1
- roc_auc
base_model: google/hear-pytorch
pipeline_tag: audio-classification
---
# HeAR-SPRSound: Respiratory Sound Abnormality Classifier
## Model Summary
A fine-tuned respiratory sound classifier built on top of **Google's HeAR** (Health Acoustic Representations) foundation model. The model performs **binary classification** β distinguishing **normal** from **abnormal** respiratory sounds β and is trained on the **SPRSound** dataset spanning BioCAS challenge years 2022β2025.
The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a **Gated Attention Pooling** layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier.
---
## Architecture
```
Audio Input (16 kHz WAV)
β
HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 Γ 192 Γ 128])
β
HeAR ViT Encoder (google/hear-pytorch)
ββ LoRA adapters on Q & V projections in last 6 transformer blocks
β
Per-chunk CLS Embeddings [B Γ T Γ 512]
β
Gated Attention Pooling (length-masked softmax attention over chunks)
β
Pooled Representation [B Γ 512]
β
MLP Classifier (512 β 256 β 2, GELU, Dropout 0.4)
β
Normal / Abnormal
```
**Key components:**
- **Backbone**: `google/hear-pytorch` (frozen except LoRA layers + LayerNorms)
- **LoRA**: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks
- **Pooling**: Gated Attention Pool (dual-path tanh Γ sigmoid gating, hidden dim 512)
- **Loss**: Focal Loss (Ξ³=2.0) with class-balanced sample weighting
- **Inference**: Per-class threshold optimization (one-vs-rest F1 on validation set)
---
## Training Details
| Hyperparameter | Value |
|---|---|
| Base model | `google/hear-pytorch` |
| Input sample rate | 16,000 Hz |
| Chunk size | 2 seconds (32,000 samples) |
| Max audio duration | 10 seconds (up to 5 chunks) |
| Optimizer | AdamW |
| Learning rate | 5e-5 |
| Weight decay | 0.2 |
| Warmup epochs | 10 |
| Max epochs | 100 |
| Batch size | 96 |
| Early stopping patience | 20 epochs |
---
## Dataset
**SPRSound** β multi-year BioCAS challenge respiratory auscultation dataset.
| Year | Split |
|---|---|
| BioCAS 2022 | Train + Inter/Intra test |
| BioCAS 2023 | Test |
| BioCAS 2024 | Test |
| BioCAS 2025 | Test |
All data was **re-split at the patient level** (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme:
- **normal**: all event annotations are "Normal"
- **abnormal**: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.)
Class imbalance was addressed through `WeightedRandomSampler` and Focal Loss.
---
## Data Augmentation
A custom `PhoneLikeAugment` pipeline was applied during training (p=0.5) to simulate real-world acoustic variability:
- Random gain (β18 to +8 dB)
- Phone band-limiting (HP: 120β200 Hz, LP: 4β8 kHz)
- Fast echo / room simulation (10β80 ms delay taps)
- Colored noise addition (SNR 3β25 dB)
- Soft AGC / tanh compression
- Random time shift (Β±80 ms)
- Rare clipping (p=0.15)
---
## Usage
```python
import torch
import torchaudio
from transformers import AutoModel
# Load model
model = AdaptiveRespiratoryModel(
num_classes=2,
dropout=0.4,
use_lora=True,
lora_r=16,
lora_alpha=16,
lora_dropout=0.3,
lora_last_n_blocks=6
)
checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model.load_state_dict(checkpoint["model"], strict=False)
model.eval()
# Audio must be 16 kHz, processed through HeAR's preprocess_audio
# into chunks of shape [T, 1, 192, 128]
```
> β οΈ Requires `google/hear-pytorch` and the [HEAR](https://github.com/Google-Health/hear) library for audio preprocessing.
---
## Limitations & Intended Use
- **Intended use**: Research and prototyping in respiratory sound analysis. **Not validated for clinical use.**
- The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations.
- Binary classification only β does not distinguish between specific pathology types (e.g., wheeze vs. crackle).
- Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains.
---
## Citation
If you use this model, please cite the SPRSound dataset and the HeAR foundation model:
```bibtex
@misc{sprsound,
title = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database},
year = {2022},
note = {BioCAS 2022β2025 challenge dataset}
}
@misc{hear2024,
title = {HeAR: Health Acoustic Representations},
author = {Google Health},
year = {2024},
url = {https://github.com/Google-Health/hear}
}
```
---
## License
This model is released under the **Apache 2.0** license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms β please refer to the dataset authors. |