StethoLM / README.md

Duplicate from askyishan/StethoLM

c3c4bcc 5 days ago

4.63 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- audio
	- medical
	- cardiopulmonary
	- auscultation
	- instruction-tuning
	- lora
	- medgemma
	base_model: google/medgemma-4b-it
	datasets:
	- askyishan/StethoBench
	---

	# StethoLM

	StethoLM is the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. It integrates a cardiopulmonary audio encoder with a medical language model backbone, trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench) — a comprehensive benchmark of 77,027 instruction–response pairs from 16,125 labeled recordings.

	This work is published in the Transactions on Machine Learning Research (TMLR).

	---

	## Model Description

	StethoLM connects a COLA audio encoder (EfficientNet-based, pre-trained on cardiopulmonary sounds via [CaReAQA](https://arxiv.org/abs/2505.01199)) to MedGemma-4B-IT via a learned MLP prefix projector. The audio is encoded into a short sequence of prefix tokens that are prepended to the text input of the language model. All components — audio encoder, prefix projector, and language model (via LoRA) — are jointly fine-tuned end-to-end.

	Architecture:
	- Audio encoder: COLA (EfficientNet backbone), pre-trained on cardiopulmonary audio, outputs 1280-dim embeddings; fine-tuned during StethoLM training
	- Prefix projector: 3-layer MLP mapping audio features to 4 LM prefix tokens
	- Language model backbone: [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it) fine-tuned with LoRA (r=8, α=32)

	Training:
	- Stage 1: Supervised fine-tuning (SFT) on StethoBench training split
	- Stage 2: Multimodal Direct Preference Optimization (mDPO) with audio degradation-based conditional preference

	---

	## Intended Use

	StethoLM is designed for research on AI-assisted cardiopulmonary auscultation. It supports seven clinical task categories:

	\| Task \| Description \|
	\|------\|-------------\|
	\| Classification \| Binary normal/abnormal classification \|
	\| Identification \| Identifying specific sound types (e.g., wheezing, crackles) \|
	\| Report \| Generating a structured auscultation report \|
	\| Reasoning \| Explaining clinical findings \|
	\| Differential Diagnosis (DDx) \| Listing possible diagnoses \|
	\| Comparison \| Comparing findings across recordings \|
	\| Location \| Identifying anatomical auscultation site \|

	> ⚠️ Not for clinical use. This model is intended for research purposes only and has not been validated for clinical decision-making.

	---

	## How to Use

	This repository contains the adapter weights (fine-tuned audio encoder + LoRA adapters + prefix projector, ~713 MB). The base MedGemma-4B model is downloaded automatically from HuggingFace on first run.

	### 1. Clone the code repository

	```bash
	git clone https://github.com/askyishan/StethoLM
	cd StethoLM
	pip install -r requirements.txt
	```

	### 2. Download the adapter checkpoint

	```bash
	huggingface-cli download askyishan/StethoLM stetholm_adapter.pt --local-dir checkpoints/
	```

	### 3. Run inference

	```bash
	python predict.py \
	--input_jsonl data/stethobench.jsonl \
	--output_jsonl predictions.jsonl \
	--audio_dir /path/to/audio_files \
	--checkpoint checkpoints/stetholm_adapter.pt \
	--model_name google/medgemma-4b-it \
	--audio_encoder cola \
	--split test
	```

	---

	## Training Data

	StethoLM was trained on [StethoBench](https://huggingface.co/datasets/askyishan/StethoBench). The training split comprises recordings from 7 in-domain datasets; 4 additional datasets are held out as out-of-distribution (OOD) test sets.

	In-domain training datasets:

	\| Dataset \| Domain \|
	\|---------\|--------\|
	\| CirCor DigiScope (heart-circor) \| Heart \|
	\| SPRSound (spr) \| Lung \|
	\| COVID-UK (coviduk) \| Cough \|
	\| CoughVid (coughvid) \| Cough \|
	\| ICBHI (icbhi) \| Lung \|
	\| ZCHSound (heart-zch) \| Heart \|
	\| KAUH (kauh) \| Cardiopulmonary \|

	Out-of-distribution (OOD) test datasets:

	\| Dataset \| Domain \|
	\|---------\|--------\|
	\| BMD-HS \| Heart \|
	\| CINC \| Cardiopulmonary \|
	\| TR \| Lung \|
	\| FluSense \| Cough \|

	---

	## Citation

	If you use StethoLM or StethoBench in your research, please cite:

	```bibtex
	@article{stetholm2025,
	title = {StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks},
	author = {Wang, Yishan and Wang, Tsai-Ning and Funk, Mathias and Saeed, Aaqib},
	journal = {Transactions on Machine Learning Research},
	year = {2026},
	url = {https://huggingface.co/askyishan/StethoLM}
	}
	```