language: en tags: - audio - audio-classification - respiratory-sounds - healthcare - medical - hear - vit - lora - pytorch license: apache-2.0 datasets: - SPRSound metrics: - accuracy - f1 - roc_auc base_model: google/hear-pytorch pipeline_tag: audio-classification

HeAR-SPRSound: Respiratory Sound Abnormality Classifier

Model Summary

A fine-tuned respiratory sound classifier built on top of Google's HeAR (Health Acoustic Representations) foundation model. The model performs binary classification β€” distinguishing normal from abnormal respiratory sounds β€” and is trained on the SPRSound dataset spanning BioCAS challenge years 2022–2025.

The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a Gated Attention Pooling layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier.


Architecture

Audio Input (16 kHz WAV)
       ↓
HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 Γ— 192 Γ— 128])
       ↓
HeAR ViT Encoder (google/hear-pytorch)
  └─ LoRA adapters on Q & V projections in last 6 transformer blocks
       ↓
Per-chunk CLS Embeddings [B Γ— T Γ— 512]
       ↓
Gated Attention Pooling (length-masked softmax attention over chunks)
       ↓
Pooled Representation [B Γ— 512]
       ↓
MLP Classifier (512 β†’ 256 β†’ 2, GELU, Dropout 0.4)
       ↓
Normal / Abnormal

Key components:

  • Backbone: google/hear-pytorch (frozen except LoRA layers + LayerNorms)
  • LoRA: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks
  • Pooling: Gated Attention Pool (dual-path tanh Γ— sigmoid gating, hidden dim 512)
  • Loss: Focal Loss (Ξ³=2.0) with class-balanced sample weighting
  • Inference: Per-class threshold optimization (one-vs-rest F1 on validation set)

Training Details

Hyperparameter Value
Base model google/hear-pytorch
Input sample rate 16,000 Hz
Chunk size 2 seconds (32,000 samples)
Max audio duration 10 seconds (up to 5 chunks)
Optimizer AdamW
Learning rate 5e-5
Weight decay 0.2
Warmup epochs 10
Max epochs 100
Batch size 96
Early stopping patience 20 epochs

Dataset

SPRSound β€” multi-year BioCAS challenge respiratory auscultation dataset.

Year Split
BioCAS 2022 Train + Inter/Intra test
BioCAS 2023 Test
BioCAS 2024 Test
BioCAS 2025 Test

All data was re-split at the patient level (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme:

  • normal: all event annotations are "Normal"
  • abnormal: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.)

Class imbalance was addressed through WeightedRandomSampler and Focal Loss.


Data Augmentation

A custom PhoneLikeAugment pipeline was applied during training (p=0.5) to simulate real-world acoustic variability:

  • Random gain (βˆ’18 to +8 dB)
  • Phone band-limiting (HP: 120–200 Hz, LP: 4–8 kHz)
  • Fast echo / room simulation (10–80 ms delay taps)
  • Colored noise addition (SNR 3–25 dB)
  • Soft AGC / tanh compression
  • Random time shift (Β±80 ms)
  • Rare clipping (p=0.15)

Usage

import torch
import torchaudio
from transformers import AutoModel
# Load model
model = AdaptiveRespiratoryModel(
    num_classes=2,
    dropout=0.4,
    use_lora=True,
    lora_r=16,
    lora_alpha=16,
    lora_dropout=0.3,
    lora_last_n_blocks=6
)
checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model.load_state_dict(checkpoint["model"], strict=False)
model.eval()

# Audio must be 16 kHz, processed through HeAR's preprocess_audio
# into chunks of shape [T, 1, 192, 128]

⚠️ Requires google/hear-pytorch and the HEAR library for audio preprocessing.


Limitations & Intended Use

  • Intended use: Research and prototyping in respiratory sound analysis. Not validated for clinical use.
  • The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations.
  • Binary classification only β€” does not distinguish between specific pathology types (e.g., wheeze vs. crackle).
  • Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains.

Citation

If you use this model, please cite the SPRSound dataset and the HeAR foundation model:

@misc{sprsound,
  title   = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database},
  year    = {2022},
  note    = {BioCAS 2022–2025 challenge dataset}
}

@misc{hear2024,
  title   = {HeAR: Health Acoustic Representations},
  author  = {Google Health},
  year    = {2024},
  url     = {https://github.com/Google-Health/hear}
}

License

This model is released under the Apache 2.0 license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms β€” please refer to the dataset authors.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support