--- license: apache-2.0 language: - en --- --- language: en tags: - audio - audio-classification - respiratory-sounds - healthcare - medical - hear - vit - lora - pytorch license: apache-2.0 datasets: - SPRSound metrics: - accuracy - f1 - roc_auc base_model: google/hear-pytorch pipeline_tag: audio-classification --- # HeAR-SPRSound: Respiratory Sound Abnormality Classifier ## Model Summary A fine-tuned respiratory sound classifier built on top of **Google's HeAR** (Health Acoustic Representations) foundation model. The model performs **binary classification** — distinguishing **normal** from **abnormal** respiratory sounds — and is trained on the **SPRSound** dataset spanning BioCAS challenge years 2022–2025. The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a **Gated Attention Pooling** layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier. --- ## Architecture ``` Audio Input (16 kHz WAV) ↓ HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 × 192 × 128]) ↓ HeAR ViT Encoder (google/hear-pytorch) └─ LoRA adapters on Q & V projections in last 6 transformer blocks ↓ Per-chunk CLS Embeddings [B × T × 512] ↓ Gated Attention Pooling (length-masked softmax attention over chunks) ↓ Pooled Representation [B × 512] ↓ MLP Classifier (512 → 256 → 2, GELU, Dropout 0.4) ↓ Normal / Abnormal ``` **Key components:** - **Backbone**: `google/hear-pytorch` (frozen except LoRA layers + LayerNorms) - **LoRA**: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks - **Pooling**: Gated Attention Pool (dual-path tanh × sigmoid gating, hidden dim 512) - **Loss**: Focal Loss (γ=2.0) with class-balanced sample weighting - **Inference**: Per-class threshold optimization (one-vs-rest F1 on validation set) --- ## Training Details | Hyperparameter | Value | |---|---| | Base model | `google/hear-pytorch` | | Input sample rate | 16,000 Hz | | Chunk size | 2 seconds (32,000 samples) | | Max audio duration | 10 seconds (up to 5 chunks) | | Optimizer | AdamW | | Learning rate | 5e-5 | | Weight decay | 0.2 | | Warmup epochs | 10 | | Max epochs | 100 | | Batch size | 96 | | Early stopping patience | 20 epochs | --- ## Dataset **SPRSound** — multi-year BioCAS challenge respiratory auscultation dataset. | Year | Split | |---|---| | BioCAS 2022 | Train + Inter/Intra test | | BioCAS 2023 | Test | | BioCAS 2024 | Test | | BioCAS 2025 | Test | All data was **re-split at the patient level** (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme: - **normal**: all event annotations are "Normal" - **abnormal**: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.) Class imbalance was addressed through `WeightedRandomSampler` and Focal Loss. --- ## Data Augmentation A custom `PhoneLikeAugment` pipeline was applied during training (p=0.5) to simulate real-world acoustic variability: - Random gain (−18 to +8 dB) - Phone band-limiting (HP: 120–200 Hz, LP: 4–8 kHz) - Fast echo / room simulation (10–80 ms delay taps) - Colored noise addition (SNR 3–25 dB) - Soft AGC / tanh compression - Random time shift (±80 ms) - Rare clipping (p=0.15) --- ## Usage ```python import torch import torchaudio from transformers import AutoModel # Load model model = AdaptiveRespiratoryModel( num_classes=2, dropout=0.4, use_lora=True, lora_r=16, lora_alpha=16, lora_dropout=0.3, lora_last_n_blocks=6 ) checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False) model.load_state_dict(checkpoint["model"], strict=False) model.eval() # Audio must be 16 kHz, processed through HeAR's preprocess_audio # into chunks of shape [T, 1, 192, 128] ``` > ⚠️ Requires `google/hear-pytorch` and the [HEAR](https://github.com/Google-Health/hear) library for audio preprocessing. --- ## Limitations & Intended Use - **Intended use**: Research and prototyping in respiratory sound analysis. **Not validated for clinical use.** - The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations. - Binary classification only — does not distinguish between specific pathology types (e.g., wheeze vs. crackle). - Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains. --- ## Citation If you use this model, please cite the SPRSound dataset and the HeAR foundation model: ```bibtex @misc{sprsound, title = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database}, year = {2022}, note = {BioCAS 2022–2025 challenge dataset} } @misc{hear2024, title = {HeAR: Health Acoustic Representations}, author = {Google Health}, year = {2024}, url = {https://github.com/Google-Health/hear} } ``` --- ## License This model is released under the **Apache 2.0** license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms — please refer to the dataset authors.