File size: 4,549 Bytes
cae9c10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8712650
cae9c10
 
 
 
 
8712650
cae9c10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
language:
- en
license: apache-2.0
tags:
- audio
- medical
- cardiopulmonary
- auscultation
- instruction-tuning
- lora
- medgemma
base_model: google/medgemma-4b-it
datasets:
- askyishan/StethoBench
---

# StethoLM

**StethoLM** is the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. It integrates a cardiopulmonary audio encoder with a medical language model backbone, trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench) — a comprehensive benchmark of 77,027 instruction–response pairs from 16,125 labeled recordings.

This work is published in the Transactions on Machine Learning Research (TMLR).

---

## Model Description

StethoLM connects a **COLA audio encoder** (EfficientNet-based, pre-trained on cardiopulmonary sounds via [CaReAQA](https://arxiv.org/abs/2505.01199)) to **MedGemma-4B-IT** via a learned MLP prefix projector. The audio is encoded into a short sequence of prefix tokens that are prepended to the text input of the language model. All components — audio encoder, prefix projector, and language model (via LoRA) — are jointly fine-tuned end-to-end.

**Architecture:**
- **Audio encoder:** COLA (EfficientNet backbone), pre-trained on cardiopulmonary audio, outputs 1280-dim embeddings; **fine-tuned** during StethoLM training
- **Prefix projector:** 3-layer MLP mapping audio features to 4 LM prefix tokens
- **Language model backbone:** [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it) fine-tuned with LoRA (r=8, α=32)

**Training:**
- **Stage 1:** Supervised fine-tuning (SFT) on StethoBench training split
- **Stage 2:** Multimodal Direct Preference Optimization (mDPO) with audio degradation-based conditional preference

---

## Intended Use

StethoLM is designed for **research** on AI-assisted cardiopulmonary auscultation. It supports seven clinical task categories:

| Task | Description |
|------|-------------|
| **Classification** | Binary normal/abnormal classification |
| **Identification** | Identifying specific sound types (e.g., wheezing, crackles) |
| **Report** | Generating a structured auscultation report |
| **Reasoning** | Explaining clinical findings |
| **Differential Diagnosis (DDx)** | Listing possible diagnoses |
| **Comparison** | Comparing findings across recordings |
| **Location** | Identifying anatomical auscultation site |

> ⚠️ **Not for clinical use.** This model is intended for research purposes only and has not been validated for clinical decision-making.

---

## How to Use

This repository contains the **adapter weights** (fine-tuned audio encoder + LoRA adapters + prefix projector, ~713 MB). The base MedGemma-4B model is downloaded automatically from HuggingFace on first run.

### 1. Clone the code repository

```bash
git clone https://github.com/askyishan/StethoLM
cd StethoLM
pip install -r requirements.txt
```

### 2. Download the adapter checkpoint

```bash
huggingface-cli download askyishan/StethoLM stetholm_adapter.pt --local-dir checkpoints/
```

### 3. Run inference

```bash
python predict.py \
    --input_jsonl data/stethobench.jsonl \
    --output_jsonl predictions.jsonl \
    --audio_dir /path/to/audio_files \
    --checkpoint checkpoints/stetholm_adapter.pt \
    --model_name google/medgemma-4b-it \
    --audio_encoder cola \
    --split test
```

---

## Training Data

StethoLM was trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench). The training split comprises recordings from 7 in-domain datasets; 4 additional datasets are held out as out-of-distribution (OOD) test sets.

**In-domain training datasets:**

| Dataset | Domain |
|---------|--------|
| CirCor DigiScope (heart-circor) | Heart |
| SPRSound (spr) | Lung |
| COVID-UK (coviduk) | Cough |
| CoughVid (coughvid) | Cough |
| ICBHI (icbhi) | Lung |
| ZCHSound (heart-zch) | Heart |
| KAUH (kauh) | Cardiopulmonary |

**Out-of-distribution (OOD) test datasets:**

| Dataset | Domain |
|---------|--------|
| BMD-HS | Heart |
| CINC | Cardiopulmonary |
| TR | Lung |
| FluSense | Cough |

---

## Citation

If you use StethoLM or StethoBench in your research, please cite:

```bibtex
@article{stetholm2025,
  title     = {StethoLM: An Audio–Language Model for Cardiopulmonary Auscultation},
  author    = {},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://huggingface.co/askyishan/StethoLM}
}
```