askyishan commited on
Commit
cae9c10
·
verified ·
1 Parent(s): 967b87d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +131 -0
README.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - audio
7
+ - medical
8
+ - cardiopulmonary
9
+ - auscultation
10
+ - instruction-tuning
11
+ - lora
12
+ - medgemma
13
+ base_model: google/medgemma-4b-it
14
+ datasets:
15
+ - askyishan/StethoBench
16
+ ---
17
+
18
+ # StethoLM
19
+
20
+ **StethoLM** is the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. It integrates a cardiopulmonary audio encoder with a medical language model backbone, trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench) — a comprehensive benchmark of 77,027 instruction–response pairs from 16,125 labeled recordings.
21
+
22
+ > Published at **TMLR 2025**.
23
+
24
+ ---
25
+
26
+ ## Model Description
27
+
28
+ StethoLM connects a **COLA audio encoder** (EfficientNet-based, pre-trained on cardiopulmonary sounds via [CaReAQA](https://arxiv.org/abs/2501.02225)) to **MedGemma-4B-IT** via a learned MLP prefix projector. The audio is encoded into a short sequence of prefix tokens that are prepended to the text input of the language model. All components — audio encoder, prefix projector, and language model (via LoRA) — are jointly fine-tuned end-to-end.
29
+
30
+ **Architecture:**
31
+ - **Audio encoder:** COLA (EfficientNet backbone), pre-trained on cardiopulmonary audio, outputs 1280-dim embeddings; **fine-tuned** during StethoLM training
32
+ - **Prefix projector:** 3-layer MLP mapping audio features to 4 LM prefix tokens
33
+ - **Language model backbone:** [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it) fine-tuned with LoRA (r=8, α=32)
34
+
35
+ **Training:**
36
+ - **Stage 1:** Supervised fine-tuning (SFT) on StethoBench training split
37
+ - **Stage 2:** Multimodal Direct Preference Optimization (mDPO) with audio degradation-based conditional preference
38
+
39
+ ---
40
+
41
+ ## Intended Use
42
+
43
+ StethoLM is designed for **research** on AI-assisted cardiopulmonary auscultation. It supports seven clinical task categories:
44
+
45
+ | Task | Description |
46
+ |------|-------------|
47
+ | **Classification** | Binary normal/abnormal classification |
48
+ | **Identification** | Identifying specific sound types (e.g., wheezing, crackles) |
49
+ | **Report** | Generating a structured auscultation report |
50
+ | **Reasoning** | Explaining clinical findings |
51
+ | **Differential Diagnosis (DDx)** | Listing possible diagnoses |
52
+ | **Comparison** | Comparing findings across recordings |
53
+ | **Location** | Identifying anatomical auscultation site |
54
+
55
+ > ⚠️ **Not for clinical use.** This model is intended for research purposes only and has not been validated for clinical decision-making.
56
+
57
+ ---
58
+
59
+ ## How to Use
60
+
61
+ This repository contains the **adapter weights** (fine-tuned audio encoder + LoRA adapters + prefix projector, ~713 MB). The base MedGemma-4B model is downloaded automatically from HuggingFace on first run.
62
+
63
+ ### 1. Clone the code repository
64
+
65
+ ```bash
66
+ git clone https://github.com/askyishan/StethoLM
67
+ cd StethoLM
68
+ pip install -r requirements.txt
69
+ ```
70
+
71
+ ### 2. Download the adapter checkpoint
72
+
73
+ ```bash
74
+ huggingface-cli download askyishan/StethoLM stetholm_adapter.pt --local-dir checkpoints/
75
+ ```
76
+
77
+ ### 3. Run inference
78
+
79
+ ```bash
80
+ python predict.py \
81
+ --input_jsonl data/stethobench.jsonl \
82
+ --output_jsonl predictions.jsonl \
83
+ --audio_dir /path/to/audio_files \
84
+ --checkpoint checkpoints/stetholm_adapter.pt \
85
+ --model_name google/medgemma-4b-it \
86
+ --audio_encoder cola \
87
+ --split test
88
+ ```
89
+
90
+ ---
91
+
92
+ ## Training Data
93
+
94
+ StethoLM was trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench). The training split comprises recordings from 7 in-domain datasets; 4 additional datasets are held out as out-of-distribution (OOD) test sets.
95
+
96
+ **In-domain training datasets:**
97
+
98
+ | Dataset | Domain |
99
+ |---------|--------|
100
+ | CirCor DigiScope (heart-circor) | Heart |
101
+ | SPRSound (spr) | Lung |
102
+ | COVID-UK (coviduk) | Cough |
103
+ | CoughVid (coughvid) | Cough |
104
+ | ICBHI (icbhi) | Lung |
105
+ | ZCHSound (heart-zch) | Heart |
106
+ | KAUH (kauh) | Cardiopulmonary |
107
+
108
+ **Out-of-distribution (OOD) test datasets:**
109
+
110
+ | Dataset | Domain |
111
+ |---------|--------|
112
+ | BMD-HS | Heart |
113
+ | CINC | Cardiopulmonary |
114
+ | TR | Lung |
115
+ | FluSense | Cough |
116
+
117
+ ---
118
+
119
+ ## Citation
120
+
121
+ If you use StethoLM or StethoBench in your research, please cite:
122
+
123
+ ```bibtex
124
+ @article{stetholm2025,
125
+ title = {StethoLM: An Audio–Language Model for Cardiopulmonary Auscultation},
126
+ author = {},
127
+ journal = {Transactions on Machine Learning Research},
128
+ year = {2025},
129
+ url = {https://huggingface.co/askyishan/StethoLM}
130
+ }
131
+ ```