Upload MIMIC test evaluation results

Browse files

Files changed (6) hide show

README.md +81 -104
evaluations/mimic_test_findings_only_metrics.json +34 -0
evaluations/mimic_test_findings_only_predictions.csv +0 -0
evaluations/mimic_test_metrics.json +71 -1
evaluations/mimic_test_predictions.csv +0 -0
run_summary.json +71 -0

README.md CHANGED Viewed

@@ -29,14 +29,6 @@ metrics:
 ![Layer-Wise Anatomical Attention](assets/AnatomicalAttention.gif)
-## Status
-- Project status: `Training in progress`
-- Release status: `Research preview checkpoint`
-- Current checkpoint status: `Not final`
-- Training completion toward planned run: `100.00%` (`4.000` / `3` epochs)
-- Current published metrics are intermediate and will change as training continues.
 ## Overview
 LAnA is a medical report-generation project for chest X-ray images. The completed project is intended to generate radiology reports with a vision-language model guided by layer-wise anatomical attention built from predicted anatomical masks.
@@ -45,82 +37,6 @@ The architecture combines a DINOv3 vision encoder, lung and heart segmentation h
 ## How to Run
-For local inference instructions, go to the [Inference](#inference) section.
-## Intended Use
-- Input: a chest X-ray image resized to `512x512` and normalized with ImageNet mean/std.
-- Output: a generated radiology report.
-- Best fit: research use, report-generation experiments, and anatomical-attention ablations.
-## Data
-- Full project datasets: CheXpert and MIMIC-CXR.
-- Intended project scope: train on curated chest X-ray/report data from both datasets and evaluate on MIMIC-CXR test studies.
-- Current released checkpoint datasets: `CheXpert, MIMIC-CXR` for training and `CheXpert, MIMIC-CXR` for validation.
-- Current published evaluation: MIMIC-CXR test split, `frontal-only (PA/AP)` studies.
-## Evaluation
-- Text-generation metrics used in this project include BLEU, METEOR, ROUGE, and CIDEr.
-- Medical report metrics implemented in the repository include RadGraph F1 and CheXpert F1 (`14-micro`, `5-micro`, `14-macro`, `5-macro`).
-## Training Snapshot
-- Run: `full_3_epoch_mask_run`
-- This section describes the current public checkpoint, not the final completed project.
-- Method: `lora_adamw`
-- Vision encoder: `facebook/dinov3-vits16-pretrain-lvd1689m`
-- Text decoder: `gpt2`
-- Segmentation encoder: `facebook/dinov3-convnext-small-pretrain-lvd1689m`
-- Image size: `512`
-- Local batch size: `1`
-- Effective global batch size: `8`
-- Scheduler: `cosine`
-- Warmup steps: `5114`
-- Weight decay: `0.01`
-- Steps completed: `102264`
-- Planned total steps: `102276`
-- Images seen: `818196`
-- Total training time: `23.5798` hours
-- Hardware: `NVIDIA GeForce RTX 5070`
-- Final train loss: `1.1683`
-- Validation loss: `1.3692`
-## MIMIC Test Results
-Frontal-only evaluation using `PA/AP` studies only.
-### Current Checkpoint Results
-| Metric | Value |
-| --- | --- |
-| Number of studies | `3041` |
-| RadGraph F1 | `0.0918` |
-| RadGraph entity F1 | `0.1399` |
-| RadGraph relation F1 | `0.1246` |
-| CheXpert F1 14-micro | `0.1829` |
-| CheXpert F1 5-micro | `0.2183` |
-| CheXpert F1 14-macro | `0.1095` |
-| CheXpert F1 5-macro | `0.1634` |
-### Final Completed Training Results
-The final table will be populated when the planned training run is completed. Until then, final-report metrics remain `TBD`.
-| Metric | Value |
-| --- | --- |
-| Number of studies | TBD |
-| RadGraph F1 | TBD |
-| RadGraph entity F1 | TBD |
-| RadGraph relation F1 | TBD |
-| CheXpert F1 14-micro | TBD |
-| CheXpert F1 5-micro | TBD |
-| CheXpert F1 14-macro | TBD |
-| CheXpert F1 5-macro | TBD |
-## Inference
 Standard `AutoModel.from_pretrained(..., trust_remote_code=True)` loading is currently blocked for this repo because the custom model constructor performs nested pretrained submodel loads.
 Use the verified manual load path below instead: download the HF repo snapshot, import the downloaded package, and load the exported `model.safetensors` directly.
@@ -171,27 +87,88 @@ report = model.tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
 print(report)
 ```
 ## Notes
 - `segmenters/` contains the lung and heart segmentation checkpoints used to build anatomical attention masks.
 - `evaluations/mimic_test_metrics.json` contains the latest saved MIMIC test metrics.
-<!-- EVAL_RESULTS_START -->
-## Latest Evaluation
-- Dataset: `MIMIC-CXR test`
-- View filter: `frontal-only (PA/AP)`
-- Number of examples: `3041`
-- CheXpert F1 14-micro: `0.1829`
-- CheXpert F1 5-micro: `0.2183`
-- CheXpert F1 14-macro: `0.1095`
-- CheXpert F1 5-macro: `0.1634`
-- RadGraph F1: `0.0918`
-- RadGraph entity F1: `0.1399`
-- RadGraph relation F1: `0.1246`
-- RadGraph available: `True`
-- RadGraph error: `None`
-- Evaluation file: `evaluations/mimic_test_metrics.json`
-- Predictions file: `evaluations/mimic_test_predictions.csv`
-<!-- EVAL_RESULTS_END -->

 ![Layer-Wise Anatomical Attention](assets/AnatomicalAttention.gif)
 ## Overview
 LAnA is a medical report-generation project for chest X-ray images. The completed project is intended to generate radiology reports with a vision-language model guided by layer-wise anatomical attention built from predicted anatomical masks.
 ## How to Run
 Standard `AutoModel.from_pretrained(..., trust_remote_code=True)` loading is currently blocked for this repo because the custom model constructor performs nested pretrained submodel loads.
 Use the verified manual load path below instead: download the HF repo snapshot, import the downloaded package, and load the exported `model.safetensors` directly.
 print(report)
 ```
+## Intended Use
+- Input: a chest X-ray image resized to `512x512` and normalized with ImageNet mean/std.
+- Output: a generated radiology report.
+- Best fit: research use, report-generation experiments, and anatomical-attention ablations.
+## MIMIC Test Results
+Frontal-only evaluation using `PA/AP` studies only.
+### Final Completed Training Results
+These final-report metrics correspond to the completed training run.
+### All Frontal Test Studies
+| Metric | Value |
+| --- | --- |
+| Number of studies | `3041` |
+| RadGraph F1 | `0.0918` |
+| RadGraph entity F1 | `0.1399` |
+| RadGraph relation F1 | `0.1246` |
+| CheXpert F1 14-micro | `0.1829` |
+| CheXpert F1 5-micro | `0.2183` |
+| CheXpert F1 14-macro | `0.1095` |
+| CheXpert F1 5-macro | `0.1634` |
+### Findings-Only Frontal Test Studies
+| Metric | Value |
+| --- | --- |
+| Number of studies | `2210` |
+| RadGraph F1 | `0.1010` |
+| RadGraph entity F1 | `0.1517` |
+| RadGraph relation F1 | `0.1347` |
+| CheXpert F1 14-micro | `0.1651` |
+| CheXpert F1 5-micro | `0.2152` |
+| CheXpert F1 14-macro | `0.1047` |
+| CheXpert F1 5-macro | `0.1611` |
+## Data
+- Full project datasets: CheXpert and MIMIC-CXR.
+- Intended project scope: train on curated chest X-ray/report data from both datasets and evaluate on MIMIC-CXR test studies.
+- Current released checkpoint datasets: `CheXpert, MIMIC-CXR` for training and `CheXpert, MIMIC-CXR` for validation.
+- Current published evaluation: MIMIC-CXR test split, `frontal-only (PA/AP)` studies.
+## Evaluation
+- Medical report metrics implemented in the repository include RadGraph F1 and CheXpert F1 (`14-micro`, `5-micro`, `14-macro`, `5-macro`).
+## Training Snapshot
+- Run: `full_3_epoch_mask_run`
+- This section describes the completed public training run.
+- Method: `lora_adamw`
+- Vision encoder: `facebook/dinov3-vits16-pretrain-lvd1689m`
+- Text decoder: `gpt2`
+- Segmentation encoder: `facebook/dinov3-convnext-small-pretrain-lvd1689m`
+- Image size: `512`
+- Local batch size: `1`
+- Effective global batch size: `8`
+- Scheduler: `cosine`
+- Warmup steps: `5114`
+- Weight decay: `0.01`
+- Steps completed: `102264`
+- Planned total steps: `102276`
+- Images seen: `818196`
+- Total training time: `23.5798` hours
+- Hardware: `NVIDIA GeForce RTX 5070`
+- Final train loss: `1.1683`
+- Validation loss: `1.3692`
+## Status
+- Project status: `Training completed`
+- Release status: `Completed training run`
+- Current checkpoint status: `Final completed run`
+- Training completion toward planned run: `100.00%` (`3` / `3` epochs)
+- Current published metrics correspond to the completed training run.
 ## Notes
 - `segmenters/` contains the lung and heart segmentation checkpoints used to build anatomical attention masks.
 - `evaluations/mimic_test_metrics.json` contains the latest saved MIMIC test metrics.

evaluations/mimic_test_findings_only_metrics.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "split": "test",
+  "subset": "findings-only frontal studies",
+  "dataset": "mimic-cxr",
+  "view_filter": "frontal-only (PA/AP), structured Findings section only",
+  "num_examples": 2210,
+  "chexpert_f1_14_micro": 0.16506270049577138,
+  "chexpert_f1_5_micro": 0.21520692974013475,
+  "chexpert_f1_14_macro": 0.10472446617305661,
+  "chexpert_f1_5_macro": 0.16106779379149633,
+  "chexpert_f1_micro": 0.16506270049577138,
+  "chexpert_f1_macro": 0.10472446617305661,
+  "chexpert_per_label_f1": {
+    "Enlarged Cardiomediastinum": 0.0,
+    "Cardiomegaly": 0.09737827715355805,
+    "Lung Opacity": 0.0,
+    "Lung Lesion": 0.0,
+    "Edema": 0.27852998065764023,
+    "Consolidation": 0.0667384284176534,
+    "Pneumonia": 0.1375796178343949,
+    "Atelectasis": 0.0482897384305835,
+    "Pneumothorax": 0.021455938697318006,
+    "Pleural Effusion": 0.31440254429804637,
+    "Pleural Other": 0.0,
+    "Fracture": 0.06052631578947368,
+    "Support Devices": 0.4412416851441242,
+    "No Finding": 0.0
+  },
+  "radgraph_f1": 0.10102933280223365,
+  "radgraph_f1_entity": 0.15171508935265537,
+  "radgraph_f1_relation": 0.13465579667248295,
+  "radgraph_available": true,
+  "radgraph_error": null
+}

evaluations/mimic_test_findings_only_predictions.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

evaluations/mimic_test_metrics.json CHANGED Viewed

@@ -1,5 +1,6 @@
 {
   "split": "test",
   "dataset": "mimic-cxr",
   "view_filter": "frontal-only (PA/AP)",
   "num_examples": 3041,
@@ -29,5 +30,74 @@
   "radgraph_f1_entity": 0.13993790644379023,
   "radgraph_f1_relation": 0.12464719867951028,
   "radgraph_available": true,
-  "radgraph_error": null
 }

 {
   "split": "test",
+  "subset": "all frontal studies",
   "dataset": "mimic-cxr",
   "view_filter": "frontal-only (PA/AP)",
   "num_examples": 3041,
   "radgraph_f1_entity": 0.13993790644379023,
   "radgraph_f1_relation": 0.12464719867951028,
   "radgraph_available": true,
+  "radgraph_error": null,
+  "evaluation_suite": "mimic_test_dual",
+  "all_test": {
+    "split": "test",
+    "subset": "all frontal studies",
+    "dataset": "mimic-cxr",
+    "view_filter": "frontal-only (PA/AP)",
+    "num_examples": 3041,
+    "chexpert_f1_14_micro": 0.18291666666666664,
+    "chexpert_f1_5_micro": 0.21831082003001773,
+    "chexpert_f1_14_macro": 0.10945797832551928,
+    "chexpert_f1_5_macro": 0.1633553219570594,
+    "chexpert_f1_micro": 0.18291666666666664,
+    "chexpert_f1_macro": 0.10945797832551928,
+    "chexpert_per_label_f1": {
+      "Enlarged Cardiomediastinum": 0.0,
+      "Cardiomegaly": 0.10195227765726682,
+      "Lung Opacity": 0.0020470829068577278,
+      "Lung Lesion": 0.0,
+      "Edema": 0.2789757412398922,
+      "Consolidation": 0.06424344885883347,
+      "Pneumonia": 0.14311926605504585,
+      "Atelectasis": 0.0428380187416332,
+      "Pneumothorax": 0.030358227079538558,
+      "Pleural Effusion": 0.32876712328767127,
+      "Pleural Other": 0.0,
+      "Fracture": 0.0633879781420765,
+      "Support Devices": 0.4767225325884544,
+      "No Finding": 0.0
+    },
+    "radgraph_f1": 0.09181957971495504,
+    "radgraph_f1_entity": 0.13993790644379023,
+    "radgraph_f1_relation": 0.12464719867951028,
+    "radgraph_available": true,
+    "radgraph_error": null
+  },
+  "findings_only_test": {
+    "split": "test",
+    "subset": "findings-only frontal studies",
+    "dataset": "mimic-cxr",
+    "view_filter": "frontal-only (PA/AP), structured Findings section only",
+    "num_examples": 2210,
+    "chexpert_f1_14_micro": 0.16506270049577138,
+    "chexpert_f1_5_micro": 0.21520692974013475,
+    "chexpert_f1_14_macro": 0.10472446617305661,
+    "chexpert_f1_5_macro": 0.16106779379149633,
+    "chexpert_f1_micro": 0.16506270049577138,
+    "chexpert_f1_macro": 0.10472446617305661,
+    "chexpert_per_label_f1": {
+      "Enlarged Cardiomediastinum": 0.0,
+      "Cardiomegaly": 0.09737827715355805,
+      "Lung Opacity": 0.0,
+      "Lung Lesion": 0.0,
+      "Edema": 0.27852998065764023,
+      "Consolidation": 0.0667384284176534,
+      "Pneumonia": 0.1375796178343949,
+      "Atelectasis": 0.0482897384305835,
+      "Pneumothorax": 0.021455938697318006,
+      "Pleural Effusion": 0.31440254429804637,
+      "Pleural Other": 0.0,
+      "Fracture": 0.06052631578947368,
+      "Support Devices": 0.4412416851441242,
+      "No Finding": 0.0
+    },
+    "radgraph_f1": 0.10102933280223365,
+    "radgraph_f1_entity": 0.15171508935265537,
+    "radgraph_f1_relation": 0.13465579667248295,
+    "radgraph_available": true,
+    "radgraph_error": null
+  }
 }

evaluations/mimic_test_predictions.csv CHANGED Viewed

The diff for this file is too large to render. See raw diff

run_summary.json CHANGED Viewed

@@ -42,6 +42,7 @@
   "validation_datasets": "CheXpert, MIMIC-CXR",
   "latest_evaluation": {
     "split": "test",
     "dataset": "mimic-cxr",
     "view_filter": "frontal-only (PA/AP)",
     "num_examples": 3041,
@@ -72,5 +73,75 @@
     "radgraph_f1_relation": 0.12464719867951028,
     "radgraph_available": true,
     "radgraph_error": null
   }
 }

   "validation_datasets": "CheXpert, MIMIC-CXR",
   "latest_evaluation": {
     "split": "test",
+    "subset": "all frontal studies",
     "dataset": "mimic-cxr",
     "view_filter": "frontal-only (PA/AP)",
     "num_examples": 3041,
     "radgraph_f1_relation": 0.12464719867951028,
     "radgraph_available": true,
     "radgraph_error": null
+  },
+  "latest_evaluations": {
+    "all_test": {
+      "split": "test",
+      "subset": "all frontal studies",
+      "dataset": "mimic-cxr",
+      "view_filter": "frontal-only (PA/AP)",
+      "num_examples": 3041,
+      "chexpert_f1_14_micro": 0.18291666666666664,
+      "chexpert_f1_5_micro": 0.21831082003001773,
+      "chexpert_f1_14_macro": 0.10945797832551928,
+      "chexpert_f1_5_macro": 0.1633553219570594,
+      "chexpert_f1_micro": 0.18291666666666664,
+      "chexpert_f1_macro": 0.10945797832551928,
+      "chexpert_per_label_f1": {
+        "Enlarged Cardiomediastinum": 0.0,
+        "Cardiomegaly": 0.10195227765726682,
+        "Lung Opacity": 0.0020470829068577278,
+        "Lung Lesion": 0.0,
+        "Edema": 0.2789757412398922,
+        "Consolidation": 0.06424344885883347,
+        "Pneumonia": 0.14311926605504585,
+        "Atelectasis": 0.0428380187416332,
+        "Pneumothorax": 0.030358227079538558,
+        "Pleural Effusion": 0.32876712328767127,
+        "Pleural Other": 0.0,
+        "Fracture": 0.0633879781420765,
+        "Support Devices": 0.4767225325884544,
+        "No Finding": 0.0
+      },
+      "radgraph_f1": 0.09181957971495504,
+      "radgraph_f1_entity": 0.13993790644379023,
+      "radgraph_f1_relation": 0.12464719867951028,
+      "radgraph_available": true,
+      "radgraph_error": null
+    },
+    "findings_only_test": {
+      "split": "test",
+      "subset": "findings-only frontal studies",
+      "dataset": "mimic-cxr",
+      "view_filter": "frontal-only (PA/AP), structured Findings section only",
+      "num_examples": 2210,
+      "chexpert_f1_14_micro": 0.16506270049577138,
+      "chexpert_f1_5_micro": 0.21520692974013475,
+      "chexpert_f1_14_macro": 0.10472446617305661,
+      "chexpert_f1_5_macro": 0.16106779379149633,
+      "chexpert_f1_micro": 0.16506270049577138,
+      "chexpert_f1_macro": 0.10472446617305661,
+      "chexpert_per_label_f1": {
+        "Enlarged Cardiomediastinum": 0.0,
+        "Cardiomegaly": 0.09737827715355805,
+        "Lung Opacity": 0.0,
+        "Lung Lesion": 0.0,
+        "Edema": 0.27852998065764023,
+        "Consolidation": 0.0667384284176534,
+        "Pneumonia": 0.1375796178343949,
+        "Atelectasis": 0.0482897384305835,
+        "Pneumothorax": 0.021455938697318006,
+        "Pleural Effusion": 0.31440254429804637,
+        "Pleural Other": 0.0,
+        "Fracture": 0.06052631578947368,
+        "Support Devices": 0.4412416851441242,
+        "No Finding": 0.0
+      },
+      "radgraph_f1": 0.10102933280223365,
+      "radgraph_f1_entity": 0.15171508935265537,
+      "radgraph_f1_relation": 0.13465579667248295,
+      "radgraph_available": true,
+      "radgraph_error": null
+    }
   }
 }