language: en tags: - audio - audio-classification - respiratory-sounds - healthcare - medical - hear - vit - lora - pytorch license: apache-2.0 datasets: - SPRSound metrics: - accuracy - f1 - roc_auc base_model: google/hear-pytorch pipeline_tag: audio-classification

HeAR-SPRSound: Respiratory Sound Abnormality Classifier

Model Summary

A fine-tuned respiratory sound classifier built on top of Google's HeAR (Health Acoustic Representations) foundation model. The model performs binary classification — distinguishing normal from abnormal respiratory sounds — and is trained on the SPRSound dataset spanning BioCAS challenge years 2022–2025.

The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a Gated Attention Pooling layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier.

Architecture

Audio Input (16 kHz WAV)
       ↓
HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 × 192 × 128])
       ↓
HeAR ViT Encoder (google/hear-pytorch)
  └─ LoRA adapters on Q & V projections in last 6 transformer blocks
       ↓
Per-chunk CLS Embeddings [B × T × 512]
       ↓
Gated Attention Pooling (length-masked softmax attention over chunks)
       ↓
Pooled Representation [B × 512]
       ↓
MLP Classifier (512 → 256 → 2, GELU, Dropout 0.4)
       ↓
Normal / Abnormal

Key components:

Backbone: google/hear-pytorch (frozen except LoRA layers + LayerNorms)
LoRA: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks
Pooling: Gated Attention Pool (dual-path tanh × sigmoid gating, hidden dim 512)
Loss: Focal Loss (γ=2.0) with class-balanced sample weighting
Inference: Per-class threshold optimization (one-vs-rest F1 on validation set)

Training Details

Hyperparameter	Value
Base model	`google/hear-pytorch`
Input sample rate	16,000 Hz
Chunk size	2 seconds (32,000 samples)
Max audio duration	10 seconds (up to 5 chunks)
Optimizer	AdamW
Learning rate	5e-5
Weight decay	0.2
Warmup epochs	10
Max epochs	100
Batch size	96
Early stopping patience	20 epochs

Dataset

SPRSound — multi-year BioCAS challenge respiratory auscultation dataset.

Year	Split
BioCAS 2022	Train + Inter/Intra test
BioCAS 2023	Test
BioCAS 2024	Test
BioCAS 2025	Test

All data was re-split at the patient level (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme:

normal: all event annotations are "Normal"
abnormal: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.)

Class imbalance was addressed through WeightedRandomSampler and Focal Loss.

Data Augmentation

A custom PhoneLikeAugment pipeline was applied during training (p=0.5) to simulate real-world acoustic variability:

Random gain (−18 to +8 dB)
Phone band-limiting (HP: 120–200 Hz, LP: 4–8 kHz)
Fast echo / room simulation (10–80 ms delay taps)
Colored noise addition (SNR 3–25 dB)
Soft AGC / tanh compression
Random time shift (±80 ms)
Rare clipping (p=0.15)

Usage

import torch
import torchaudio
from transformers import AutoModel
# Load model
model = AdaptiveRespiratoryModel(
    num_classes=2,
    dropout=0.4,
    use_lora=True,
    lora_r=16,
    lora_alpha=16,
    lora_dropout=0.3,
    lora_last_n_blocks=6
)
checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model.load_state_dict(checkpoint["model"], strict=False)
model.eval()

# Audio must be 16 kHz, processed through HeAR's preprocess_audio
# into chunks of shape [T, 1, 192, 128]

⚠️ Requires google/hear-pytorch and the HEAR library for audio preprocessing.

Limitations & Intended Use

Intended use: Research and prototyping in respiratory sound analysis. Not validated for clinical use.
The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations.
Binary classification only — does not distinguish between specific pathology types (e.g., wheeze vs. crackle).
Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains.

Citation

If you use this model, please cite the SPRSound dataset and the HeAR foundation model:

@misc{sprsound,
  title   = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database},
  year    = {2022},
  note    = {BioCAS 2022–2025 challenge dataset}
}

@misc{hear2024,
  title   = {HeAR: Health Acoustic Representations},
  author  = {Google Health},
  year    = {2024},
  url     = {https://github.com/Google-Health/hear}
}

License

This model is released under the Apache 2.0 license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms — please refer to the dataset authors.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support