| --- |
| license: apache-2.0 |
| language: |
| - en |
| --- |
| --- |
| language: en |
| tags: |
| - audio |
| - audio-classification |
| - respiratory-sounds |
| - healthcare |
| - medical |
| - hear |
| - vit |
| - lora |
| - pytorch |
| license: apache-2.0 |
| datasets: |
| - SPRSound |
| metrics: |
| - accuracy |
| - f1 |
| - roc_auc |
| base_model: google/hear-pytorch |
| pipeline_tag: audio-classification |
| --- |
| |
| # HeAR-SPRSound: Respiratory Sound Abnormality Classifier |
| |
| ## Model Summary |
| |
| A fine-tuned respiratory sound classifier built on top of **Google's HeAR** (Health Acoustic Representations) foundation model. The model performs **binary classification** β distinguishing **normal** from **abnormal** respiratory sounds β and is trained on the **SPRSound** dataset spanning BioCAS challenge years 2022β2025. |
| |
| The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a **Gated Attention Pooling** layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier. |
| |
| --- |
| |
| ## Architecture |
| |
| ``` |
| Audio Input (16 kHz WAV) |
| β |
| HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 Γ 192 Γ 128]) |
| β |
| HeAR ViT Encoder (google/hear-pytorch) |
| ββ LoRA adapters on Q & V projections in last 6 transformer blocks |
| β |
| Per-chunk CLS Embeddings [B Γ T Γ 512] |
| β |
| Gated Attention Pooling (length-masked softmax attention over chunks) |
| β |
| Pooled Representation [B Γ 512] |
| β |
| MLP Classifier (512 β 256 β 2, GELU, Dropout 0.4) |
| β |
| Normal / Abnormal |
| ``` |
| |
| **Key components:** |
| - **Backbone**: `google/hear-pytorch` (frozen except LoRA layers + LayerNorms) |
| - **LoRA**: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks |
| - **Pooling**: Gated Attention Pool (dual-path tanh Γ sigmoid gating, hidden dim 512) |
| - **Loss**: Focal Loss (Ξ³=2.0) with class-balanced sample weighting |
| - **Inference**: Per-class threshold optimization (one-vs-rest F1 on validation set) |
| |
| --- |
| |
| ## Training Details |
| |
| | Hyperparameter | Value | |
| |---|---| |
| | Base model | `google/hear-pytorch` | |
| | Input sample rate | 16,000 Hz | |
| | Chunk size | 2 seconds (32,000 samples) | |
| | Max audio duration | 10 seconds (up to 5 chunks) | |
| | Optimizer | AdamW | |
| | Learning rate | 5e-5 | |
| | Weight decay | 0.2 | |
| | Warmup epochs | 10 | |
| | Max epochs | 100 | |
| | Batch size | 96 | |
| | Early stopping patience | 20 epochs | |
| |
| --- |
| |
| ## Dataset |
| |
| **SPRSound** β multi-year BioCAS challenge respiratory auscultation dataset. |
| |
| | Year | Split | |
| |---|---| |
| | BioCAS 2022 | Train + Inter/Intra test | |
| | BioCAS 2023 | Test | |
| | BioCAS 2024 | Test | |
| | BioCAS 2025 | Test | |
| |
| All data was **re-split at the patient level** (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme: |
| |
| - **normal**: all event annotations are "Normal" |
| - **abnormal**: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.) |
| |
| Class imbalance was addressed through `WeightedRandomSampler` and Focal Loss. |
| |
| --- |
| |
| ## Data Augmentation |
| |
| A custom `PhoneLikeAugment` pipeline was applied during training (p=0.5) to simulate real-world acoustic variability: |
| |
| - Random gain (β18 to +8 dB) |
| - Phone band-limiting (HP: 120β200 Hz, LP: 4β8 kHz) |
| - Fast echo / room simulation (10β80 ms delay taps) |
| - Colored noise addition (SNR 3β25 dB) |
| - Soft AGC / tanh compression |
| - Random time shift (Β±80 ms) |
| - Rare clipping (p=0.15) |
| |
| --- |
| |
| ## Usage |
| |
| ```python |
| import torch |
| import torchaudio |
| from transformers import AutoModel |
| # Load model |
| model = AdaptiveRespiratoryModel( |
| num_classes=2, |
| dropout=0.4, |
| use_lora=True, |
| lora_r=16, |
| lora_alpha=16, |
| lora_dropout=0.3, |
| lora_last_n_blocks=6 |
| ) |
| checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False) |
| model.load_state_dict(checkpoint["model"], strict=False) |
| model.eval() |
| |
| # Audio must be 16 kHz, processed through HeAR's preprocess_audio |
| # into chunks of shape [T, 1, 192, 128] |
| ``` |
| |
| > β οΈ Requires `google/hear-pytorch` and the [HEAR](https://github.com/Google-Health/hear) library for audio preprocessing. |
| |
| --- |
| |
| ## Limitations & Intended Use |
| |
| - **Intended use**: Research and prototyping in respiratory sound analysis. **Not validated for clinical use.** |
| - The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations. |
| - Binary classification only β does not distinguish between specific pathology types (e.g., wheeze vs. crackle). |
| - Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains. |
| |
| --- |
| |
| ## Citation |
| |
| If you use this model, please cite the SPRSound dataset and the HeAR foundation model: |
| |
| ```bibtex |
| @misc{sprsound, |
| title = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database}, |
| year = {2022}, |
| note = {BioCAS 2022β2025 challenge dataset} |
| } |
| |
| @misc{hear2024, |
| title = {HeAR: Health Acoustic Representations}, |
| author = {Google Health}, |
| year = {2024}, |
| url = {https://github.com/Google-Health/hear} |
| } |
| ``` |
| |
| --- |
| |
| ## License |
| |
| This model is released under the **Apache 2.0** license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms β please refer to the dataset authors. |