File size: 5,344 Bytes
4ef4eef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: apache-2.0
language:
- en
---
---
language: en
tags:
  - audio
  - audio-classification
  - respiratory-sounds
  - healthcare
  - medical
  - hear
  - vit
  - lora
  - pytorch
license: apache-2.0
datasets:
  - SPRSound
metrics:
  - accuracy
  - f1
  - roc_auc
base_model: google/hear-pytorch
pipeline_tag: audio-classification
---

# HeAR-SPRSound: Respiratory Sound Abnormality Classifier

## Model Summary

A fine-tuned respiratory sound classifier built on top of **Google's HeAR** (Health Acoustic Representations) foundation model. The model performs **binary classification** β€” distinguishing **normal** from **abnormal** respiratory sounds β€” and is trained on the **SPRSound** dataset spanning BioCAS challenge years 2022–2025.

The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a **Gated Attention Pooling** layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier.

---

## Architecture

```
Audio Input (16 kHz WAV)
       ↓
HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 Γ— 192 Γ— 128])
       ↓
HeAR ViT Encoder (google/hear-pytorch)
  └─ LoRA adapters on Q & V projections in last 6 transformer blocks
       ↓
Per-chunk CLS Embeddings [B Γ— T Γ— 512]
       ↓
Gated Attention Pooling (length-masked softmax attention over chunks)
       ↓
Pooled Representation [B Γ— 512]
       ↓
MLP Classifier (512 β†’ 256 β†’ 2, GELU, Dropout 0.4)
       ↓
Normal / Abnormal
```

**Key components:**
- **Backbone**: `google/hear-pytorch` (frozen except LoRA layers + LayerNorms)
- **LoRA**: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks
- **Pooling**: Gated Attention Pool (dual-path tanh Γ— sigmoid gating, hidden dim 512)
- **Loss**: Focal Loss (Ξ³=2.0) with class-balanced sample weighting
- **Inference**: Per-class threshold optimization (one-vs-rest F1 on validation set)

---

## Training Details

| Hyperparameter | Value |
|---|---|
| Base model | `google/hear-pytorch` |
| Input sample rate | 16,000 Hz |
| Chunk size | 2 seconds (32,000 samples) |
| Max audio duration | 10 seconds (up to 5 chunks) |
| Optimizer | AdamW |
| Learning rate | 5e-5 |
| Weight decay | 0.2 |
| Warmup epochs | 10 |
| Max epochs | 100 |
| Batch size | 96 |
| Early stopping patience | 20 epochs |

---

## Dataset

**SPRSound** β€” multi-year BioCAS challenge respiratory auscultation dataset.

| Year | Split |
|---|---|
| BioCAS 2022 | Train + Inter/Intra test |
| BioCAS 2023 | Test |
| BioCAS 2024 | Test |
| BioCAS 2025 | Test |

All data was **re-split at the patient level** (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme:

- **normal**: all event annotations are "Normal"
- **abnormal**: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.)

Class imbalance was addressed through `WeightedRandomSampler` and Focal Loss.

---

## Data Augmentation

A custom `PhoneLikeAugment` pipeline was applied during training (p=0.5) to simulate real-world acoustic variability:

- Random gain (βˆ’18 to +8 dB)
- Phone band-limiting (HP: 120–200 Hz, LP: 4–8 kHz)
- Fast echo / room simulation (10–80 ms delay taps)
- Colored noise addition (SNR 3–25 dB)
- Soft AGC / tanh compression
- Random time shift (Β±80 ms)
- Rare clipping (p=0.15)

---

## Usage

```python
import torch
import torchaudio
from transformers import AutoModel
# Load model
model = AdaptiveRespiratoryModel(
    num_classes=2,
    dropout=0.4,
    use_lora=True,
    lora_r=16,
    lora_alpha=16,
    lora_dropout=0.3,
    lora_last_n_blocks=6
)
checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model.load_state_dict(checkpoint["model"], strict=False)
model.eval()

# Audio must be 16 kHz, processed through HeAR's preprocess_audio
# into chunks of shape [T, 1, 192, 128]
```

> ⚠️ Requires `google/hear-pytorch` and the [HEAR](https://github.com/Google-Health/hear) library for audio preprocessing.

---

## Limitations & Intended Use

- **Intended use**: Research and prototyping in respiratory sound analysis. **Not validated for clinical use.**
- The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations.
- Binary classification only β€” does not distinguish between specific pathology types (e.g., wheeze vs. crackle).
- Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains.

---

## Citation

If you use this model, please cite the SPRSound dataset and the HeAR foundation model:

```bibtex
@misc{sprsound,
  title   = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database},
  year    = {2022},
  note    = {BioCAS 2022–2025 challenge dataset}
}

@misc{hear2024,
  title   = {HeAR: Health Acoustic Representations},
  author  = {Google Health},
  year    = {2024},
  url     = {https://github.com/Google-Health/hear}
}
```

---

## License

This model is released under the **Apache 2.0** license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms β€” please refer to the dataset authors.