π©Ί CoGaze: Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays
β¨ Overview
CoGaze is a vision-language pretraining framework designed for chest X-ray understanding, inspired by how radiologists interpret medical images.
It integrates:
- ποΈ Gaze information is used during pretraining, while downstream tasks (report generation, classification, and segmentation) do not require gaze data.
- π§ Context-aware reasoning
- π Free-text & structured report generation, supervised & zero-shot classification, segmentation, image-text retrieval
π° News
- [2026-03-28] π Official code and pretrained models are released on Hugging Face
- Github https://github.com/mk-runner/CoGaze
βοΈ Installation
# Create conda environment
conda create -n cogaze python=3.10.16
conda activate cogaze
π¦ Core Dependencies
transformers==4.43.3
radgraph==0.09
pytorch-lighting==2.5.1.post0
torch==2.4.1
torchvision==0.19.1
π§© Model Zoo
| Dataset | Pretrained Model | Report Generation Model | Outputs |
|---|---|---|---|
| MIMIC-CXR | CoGaze Pretrained Checkpoint | CoGaze (DistilGPT2) | Generated Reports |
π Dataset Preparation
1οΈβ£ MIMIC-CXR Images
Dataset source: PhysioNet
data/
βββ p10/
β βββ p10000032/
β βββ s50414267/
β βββ image1.jpg
β βββ image2.jpg
βββ p11/
βββ ...
2οΈβ£ Annotations & Reports
Available on π€ Hugging Face:
- Gaze heatmap
- Image-text pairs
- SRRG annotations
π https://huggingface.co/MK-runner/CoGaze/tree/main/mimic-annotation
3οΈβ£ Checkpoint Structure
ckpt_zoo_dir/
βββ chexbert.pth
βββ radgraph/
βββ google-bert/
βββ microsoft/
βββ distilgpt2/
β οΈ Manual download required:
chexbert.pthradgraph
See: https://github.com/mk-runner/MLRG
π‘ Tip: Enable automatic download during training:
--online_ckpt "Yes"
4οΈβ£ Additional Datasets
| Task | Dataset |
|---|---|
| Classification | NIH Chest X-rays |
| Detection | RSNA Pneumonia |
| Segmentation | SIIM-ACR |
| Tuberculosis | TBX11K |
| External | Shenzhen Dataset |
π§ Training & Inference
πΉ Pretraining
bash script/pretrain.sh
πΉ Report Generation
Free-text (Training)
bash script/free-text-report-generation-gpt2.sh
bash script/free-text-report-generation-llm.sh
Free-text (Inference)
bash script/free-text-report-generation-gpt2-inference.sh
Structured Reports
bash script/structured-report-generation-gpt2.sh
π Evaluation
πΉ Compute Metrics
from tools.metrics.metrics import compute_all_scores
import pandas as pd
data = pd.read_csv("generated_reports/xxx.csv")
gts = data['reference_report'].tolist()
gens = data['generated_report'].tolist()
scores = compute_all_scores(gts, gens, args)
print(scores)
π Performance (DistilGPT2)
{
'BertScore': 0.5956377387046814,
'Radgraph-simple': 0.30690433233898795,
'Radgraph-partial': 0.28076371917819565,
'Radgraph-complete': 0.22603009157065043,
'SemScore': 0.45877182483673096,
'1/RadCliQ-V1': 1.082196619824061,
'RATEScore': 0.5787309255637078,
'chexbert_5_micro_f1': 0.5708835341365461,
'chexbert_5_macro_f1': 0.49498245207765257,
'chexbert_all_micro_p': 0.5544458762886598,
'chexbert_all_micro_r': 0.4980706154736639,
'chexbert_all_micro_f1': 0.5247484500457363,
'chexbert_all_macro_p': 0.44258976034375364,
'chexbert_all_macro_r': 0.37672752858687886,
'chexbert_all_macro_f1': 0.3883859770668801,
'BLEU_1': 0.4103171077382396,
'BLEU_2': 0.28970066408787387,
'BLEU_3': 0.22010546378006685,
'BLEU_4': 0.17481171574606008,
'METEOR': 0.19054219748683743,
'ROUGE_L': 0.3257898419599922,
'CIDer': 0.3962696560568994
}
π Citation
@misc{2026-cogaze,
title={Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays},
author={Kang Liu and Zhuoqi Ma and Siyu Liang and Yunan Li and Xiyue Gao and Chao Liang and Kun Xie and Qiguang Miao},
year={2026},
eprint={2603.26049},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.26049},
}
π Acknowledgements
- MLRG β dataset & evaluation tools
- cvt2distilgpt2 β text generation initialization
β Support
If you find this project useful:
- β Star this repository
- π Open issues for questions or bugs
- π¬ Contact Kang Liu (kangliu422@gmail.com) for collaboration