chest2vec
/

chest2vec_labeler

@@ -35,10 +35,21 @@ other (micro / macro / weighted F1) — useful for evaluating radiology report g
 ## Label space
-137 leaf labels over 9 chest-CT sections (Lungs & Airways, Pleura, Mediastinum & Hila,
-Cardiovascular, Chest Wall, Bones/Spine, Upper Abdomen, Lower Neck, Others). The exact list is
-in `config.json` (`labels`); full definitions and per-split counts are in the dataset's
-[`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).
 **Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:
@@ -80,15 +91,21 @@ as truth):
 ```python
 res = model.score_reports(gt_reports, pred_reports, tokenizer=tok)   # equal-length lists
-print(res["micro"]["f1"], res["macro"]["f1"], res["weighted"]["f1"])
-print(res["per_label"]["Pleural effusion"])   # {'precision':..,'recall':..,'f1':..,'support_gt':..}
 # or one-liner that loads the model for you:
 from modeling_chest2vec_labeler import report_f1
 report_f1(gt_reports, pred_reports, tokenizer=tok)
 ```
-Returns `micro`, `macro`, `weighted` precision/recall/F1 over the 137 labels, plus `per_label`.
 ## Inputs & conventions
@@ -100,18 +117,36 @@ Returns `micro`, `macro`, `weighted` precision/recall/F1 over the 137 labels, pl
 ## Evaluation
-Direct-paragraph evaluation (`softmax positive-class`, macro-F1 over labels with ≥30 positives —
-the stable headline; the all-labels macro is dragged down by sparse-tail labels):
-| Eval set | reports | macro-F1 (≥30) @0.33 | macro-F1 (≥30) @0.5 | macro-AUC (≥30) |
-|---|--:|--:|--:|--:|
-| CT-RATE revised test (public) | 1,464 | **0.875** | 0.866 | 0.989 |
-| sample1000 (private radiologist-reviewed gold) | 1,000 | **0.766** | 0.761 | 0.972 |
-On the public test set, a radiologist reviewed **966 reports**: **857 fully accepted, 60
-imperfect-but-acceptable, 49 failed** (94.9% acceptable). AUC barely moves public→private
-(0.989 → 0.972), i.e. label ranking transfers to real clinical reports; the F1 gap is mostly
-threshold/labeling-convention, not domain failure.
 ## Caveats

 ## Label space
+The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy
+into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in
+`config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or
+anatomy granularity.
+- The model outputs all **137** leaves. In the training data, **136** of them have at least one
+  positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness,
+  but it had no positives, so the model effectively never predicts it).
+- The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are
+  in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)**
+  dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).
+This model was **trained and evaluated on the
+[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
+dataset** (revised CT-RATE, 137-leaf taxonomy).
 **Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:
 ```python
 res = model.score_reports(gt_reports, pred_reports, tokenizer=tok)   # equal-length lists
+# scores are reported at three hierarchy levels:
+for level in ("leaf", "upper", "anatomy"):
+    b = res[level]
+    print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
+print(res["leaf"]["per_label"]["Pleural effusion"])   # {'precision':..,'recall':..,'f1':..,'support_gt':..}
 # or one-liner that loads the model for you:
 from modeling_chest2vec_labeler import report_f1
 report_f1(gt_reports, pred_reports, tokenizer=tok)
 ```
+Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns
+`micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the
+max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser
+levels are easier to match, so upper/anatomy F1 are typically higher than leaf.
 ## Inputs & conventions
 ## Evaluation
+**How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph**
+mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and
+`eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax
+probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by
+sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that
+eval set — **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy
+rows are the hierarchy roll-up.
+**CT-RATE revised test (public, 1,464 reports)** — from `chest2vec/chest2vec_labels` test split:
+| Level | # labels | macro-F1 @0.33 | macro-AUC |
+|---|--:|--:|--:|
+| leaf (≥30 positives) | 53 | **0.875** | 0.989 |
+| leaf (all evaluated) | 131 | 0.749 | — |
+| upper (≥30 positives) | 27 | 0.938 | 0.994 |
+| anatomy | 10 | 0.956 | 0.993 |
+**Private evaluation set (1,000 reports)** — a held-out internal set, not released:
+| Level | # labels | macro-F1 @0.33 | macro-AUC |
+|---|--:|--:|--:|
+| leaf (≥30 positives) | 29 | **0.766** | 0.972 |
+| leaf (all evaluated) | 60 | 0.731 | — |
+| upper (≥30 positives) | 19 | 0.837 | — |
+| anatomy | 10 | 0.869 | — |
+Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to
+the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure.
+Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully
+accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)).
 ## Caveats