File size: 4,549 Bytes
cae9c10 8712650 cae9c10 8712650 cae9c10 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
language:
- en
license: apache-2.0
tags:
- audio
- medical
- cardiopulmonary
- auscultation
- instruction-tuning
- lora
- medgemma
base_model: google/medgemma-4b-it
datasets:
- askyishan/StethoBench
---
# StethoLM
**StethoLM** is the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. It integrates a cardiopulmonary audio encoder with a medical language model backbone, trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench) — a comprehensive benchmark of 77,027 instruction–response pairs from 16,125 labeled recordings.
This work is published in the Transactions on Machine Learning Research (TMLR).
---
## Model Description
StethoLM connects a **COLA audio encoder** (EfficientNet-based, pre-trained on cardiopulmonary sounds via [CaReAQA](https://arxiv.org/abs/2505.01199)) to **MedGemma-4B-IT** via a learned MLP prefix projector. The audio is encoded into a short sequence of prefix tokens that are prepended to the text input of the language model. All components — audio encoder, prefix projector, and language model (via LoRA) — are jointly fine-tuned end-to-end.
**Architecture:**
- **Audio encoder:** COLA (EfficientNet backbone), pre-trained on cardiopulmonary audio, outputs 1280-dim embeddings; **fine-tuned** during StethoLM training
- **Prefix projector:** 3-layer MLP mapping audio features to 4 LM prefix tokens
- **Language model backbone:** [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it) fine-tuned with LoRA (r=8, α=32)
**Training:**
- **Stage 1:** Supervised fine-tuning (SFT) on StethoBench training split
- **Stage 2:** Multimodal Direct Preference Optimization (mDPO) with audio degradation-based conditional preference
---
## Intended Use
StethoLM is designed for **research** on AI-assisted cardiopulmonary auscultation. It supports seven clinical task categories:
| Task | Description |
|------|-------------|
| **Classification** | Binary normal/abnormal classification |
| **Identification** | Identifying specific sound types (e.g., wheezing, crackles) |
| **Report** | Generating a structured auscultation report |
| **Reasoning** | Explaining clinical findings |
| **Differential Diagnosis (DDx)** | Listing possible diagnoses |
| **Comparison** | Comparing findings across recordings |
| **Location** | Identifying anatomical auscultation site |
> ⚠️ **Not for clinical use.** This model is intended for research purposes only and has not been validated for clinical decision-making.
---
## How to Use
This repository contains the **adapter weights** (fine-tuned audio encoder + LoRA adapters + prefix projector, ~713 MB). The base MedGemma-4B model is downloaded automatically from HuggingFace on first run.
### 1. Clone the code repository
```bash
git clone https://github.com/askyishan/StethoLM
cd StethoLM
pip install -r requirements.txt
```
### 2. Download the adapter checkpoint
```bash
huggingface-cli download askyishan/StethoLM stetholm_adapter.pt --local-dir checkpoints/
```
### 3. Run inference
```bash
python predict.py \
--input_jsonl data/stethobench.jsonl \
--output_jsonl predictions.jsonl \
--audio_dir /path/to/audio_files \
--checkpoint checkpoints/stetholm_adapter.pt \
--model_name google/medgemma-4b-it \
--audio_encoder cola \
--split test
```
---
## Training Data
StethoLM was trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench). The training split comprises recordings from 7 in-domain datasets; 4 additional datasets are held out as out-of-distribution (OOD) test sets.
**In-domain training datasets:**
| Dataset | Domain |
|---------|--------|
| CirCor DigiScope (heart-circor) | Heart |
| SPRSound (spr) | Lung |
| COVID-UK (coviduk) | Cough |
| CoughVid (coughvid) | Cough |
| ICBHI (icbhi) | Lung |
| ZCHSound (heart-zch) | Heart |
| KAUH (kauh) | Cardiopulmonary |
**Out-of-distribution (OOD) test datasets:**
| Dataset | Domain |
|---------|--------|
| BMD-HS | Heart |
| CINC | Cardiopulmonary |
| TR | Lung |
| FluSense | Cough |
---
## Citation
If you use StethoLM or StethoBench in your research, please cite:
```bibtex
@article{stetholm2025,
title = {StethoLM: An Audio–Language Model for Cardiopulmonary Auscultation},
author = {},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://huggingface.co/askyishan/StethoLM}
}
```
|