adelelsayed1991

Update README.md

4ef4eef verified about 1 month ago

5.34 kB

	---
	license: apache-2.0
	language:
	- en
	---
	---
	language: en
	tags:
	- audio
	- audio-classification
	- respiratory-sounds
	- healthcare
	- medical
	- hear
	- vit
	- lora
	- pytorch
	license: apache-2.0
	datasets:
	- SPRSound
	metrics:
	- accuracy
	- f1
	- roc_auc
	base_model: google/hear-pytorch
	pipeline_tag: audio-classification
	---

	# HeAR-SPRSound: Respiratory Sound Abnormality Classifier

	## Model Summary

	A fine-tuned respiratory sound classifier built on top of Google's HeAR (Health Acoustic Representations) foundation model. The model performs binary classification — distinguishing normal from abnormal respiratory sounds — and is trained on the SPRSound dataset spanning BioCAS challenge years 2022–2025.

	The architecture combines the HeAR ViT backbone (fine-tuned with LoRA) with a Gated Attention Pooling layer that intelligently aggregates variable-length audio sequences chunk by chunk, followed by a two-layer MLP classifier.

	---

	## Architecture

	```
	Audio Input (16 kHz WAV)
	↓
	HeAR Preprocessing (2-second chunks, log-mel spectrograms [1 × 192 × 128])
	↓
	HeAR ViT Encoder (google/hear-pytorch)
	└─ LoRA adapters on Q & V projections in last 6 transformer blocks
	↓
	Per-chunk CLS Embeddings [B × T × 512]
	↓
	Gated Attention Pooling (length-masked softmax attention over chunks)
	↓
	Pooled Representation [B × 512]
	↓
	MLP Classifier (512 → 256 → 2, GELU, Dropout 0.4)
	↓
	Normal / Abnormal
	```

	Key components:
	- Backbone: `google/hear-pytorch` (frozen except LoRA layers + LayerNorms)
	- LoRA: rank=16, alpha=16, dropout=0.3, applied to Q+V projections in last 6 blocks
	- Pooling: Gated Attention Pool (dual-path tanh × sigmoid gating, hidden dim 512)
	- Loss: Focal Loss (γ=2.0) with class-balanced sample weighting
	- Inference: Per-class threshold optimization (one-vs-rest F1 on validation set)

	---

	## Training Details

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Base model \| `google/hear-pytorch` \|
	\| Input sample rate \| 16,000 Hz \|
	\| Chunk size \| 2 seconds (32,000 samples) \|
	\| Max audio duration \| 10 seconds (up to 5 chunks) \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 5e-5 \|
	\| Weight decay \| 0.2 \|
	\| Warmup epochs \| 10 \|
	\| Max epochs \| 100 \|
	\| Batch size \| 96 \|
	\| Early stopping patience \| 20 epochs \|

	---

	## Dataset

	SPRSound — multi-year BioCAS challenge respiratory auscultation dataset.

	\| Year \| Split \|
	\|---\|---\|
	\| BioCAS 2022 \| Train + Inter/Intra test \|
	\| BioCAS 2023 \| Test \|
	\| BioCAS 2024 \| Test \|
	\| BioCAS 2025 \| Test \|

	All data was re-split at the patient level (70% train / 15% val / 15% test) to prevent data leakage. No patient appears in more than one split. Labels were consolidated to a binary scheme:

	- normal: all event annotations are "Normal"
	- abnormal: any non-normal respiratory event present (wheeze, crackle, rhonchus, etc.)

	Class imbalance was addressed through `WeightedRandomSampler` and Focal Loss.

	---

	## Data Augmentation

	A custom `PhoneLikeAugment` pipeline was applied during training (p=0.5) to simulate real-world acoustic variability:

	- Random gain (−18 to +8 dB)
	- Phone band-limiting (HP: 120–200 Hz, LP: 4–8 kHz)
	- Fast echo / room simulation (10–80 ms delay taps)
	- Colored noise addition (SNR 3–25 dB)
	- Soft AGC / tanh compression
	- Random time shift (±80 ms)
	- Rare clipping (p=0.15)

	---

	## Usage

	```python
	import torch
	import torchaudio
	from transformers import AutoModel
	# Load model
	model = AdaptiveRespiratoryModel(
	num_classes=2,
	dropout=0.4,
	use_lora=True,
	lora_r=16,
	lora_alpha=16,
	lora_dropout=0.3,
	lora_last_n_blocks=6
	)
	checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
	model.load_state_dict(checkpoint["model"], strict=False)
	model.eval()

	# Audio must be 16 kHz, processed through HeAR's preprocess_audio
	# into chunks of shape [T, 1, 192, 128]
	```

	> ⚠️ Requires `google/hear-pytorch` and the [HEAR](https://github.com/Google-Health/hear) library for audio preprocessing.

	---

	## Limitations & Intended Use

	- Intended use: Research and prototyping in respiratory sound analysis. Not validated for clinical use.
	- The model was trained on auscultation recordings from SPRSound; performance may degrade on recordings from different stethoscope types, microphones, or patient populations.
	- Binary classification only — does not distinguish between specific pathology types (e.g., wheeze vs. crackle).
	- Threshold calibration was performed on the validation set; recalibration is recommended when deploying to new domains.

	---

	## Citation

	If you use this model, please cite the SPRSound dataset and the HeAR foundation model:

	```bibtex
	@misc{sprsound,
	title = {SPRSound: Open-Source SJTU Paediatric Respiratory Sound Database},
	year = {2022},
	note = {BioCAS 2022–2025 challenge dataset}
	}

	@misc{hear2024,
	title = {HeAR: Health Acoustic Representations},
	author = {Google Health},
	year = {2024},
	url = {https://github.com/Google-Health/hear}
	}
	```

	---

	## License

	This model is released under the Apache 2.0 license. The HeAR backbone model is subject to Google's original license terms. SPRSound data is subject to its own terms — please refer to the dataset authors.