lukeingawesome commited on
Commit
10679da
·
verified ·
1 Parent(s): ac5c585

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +54 -19
README.md CHANGED
@@ -35,10 +35,21 @@ other (micro / macro / weighted F1) — useful for evaluating radiology report g
35
 
36
  ## Label space
37
 
38
- 137 leaf labels over 9 chest-CT sections (Lungs & Airways, Pleura, Mediastinum & Hila,
39
- Cardiovascular, Chest Wall, Bones/Spine, Upper Abdomen, Lower Neck, Others). The exact list is
40
- in `config.json` (`labels`); full definitions and per-split counts are in the dataset's
41
- [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  **Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:
44
 
@@ -80,15 +91,21 @@ as truth):
80
 
81
  ```python
82
  res = model.score_reports(gt_reports, pred_reports, tokenizer=tok) # equal-length lists
83
- print(res["micro"]["f1"], res["macro"]["f1"], res["weighted"]["f1"])
84
- print(res["per_label"]["Pleural effusion"]) # {'precision':..,'recall':..,'f1':..,'support_gt':..}
 
 
 
85
 
86
  # or one-liner that loads the model for you:
87
  from modeling_chest2vec_labeler import report_f1
88
  report_f1(gt_reports, pred_reports, tokenizer=tok)
89
  ```
90
 
91
- Returns `micro`, `macro`, `weighted` precision/recall/F1 over the 137 labels, plus `per_label`.
 
 
 
92
 
93
  ## Inputs & conventions
94
 
@@ -100,18 +117,36 @@ Returns `micro`, `macro`, `weighted` precision/recall/F1 over the 137 labels, pl
100
 
101
  ## Evaluation
102
 
103
- Direct-paragraph evaluation (`softmax positive-class`, macro-F1 over labels with ≥30 positives
104
- the stable headline; the all-labels macro is dragged down by sparse-tail labels):
105
-
106
- | Eval set | reports | macro-F1 (≥30) @0.33 | macro-F1 (≥30) @0.5 | macro-AUC (≥30) |
107
- |---|--:|--:|--:|--:|
108
- | CT-RATE revised test (public) | 1,464 | **0.875** | 0.866 | 0.989 |
109
- | sample1000 (private radiologist-reviewed gold) | 1,000 | **0.766** | 0.761 | 0.972 |
110
-
111
- On the public test set, a radiologist reviewed **966 reports**: **857 fully accepted, 60
112
- imperfect-but-acceptable, 49 failed** (94.9% acceptable). AUC barely moves public→private
113
- (0.989 0.972), i.e. label ranking transfers to real clinical reports; the F1 gap is mostly
114
- threshold/labeling-convention, not domain failure.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  ## Caveats
117
 
 
35
 
36
  ## Label space
37
 
38
+ The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy
39
+ into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in
40
+ `config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or
41
+ anatomy granularity.
42
+
43
+ - The model outputs all **137** leaves. In the training data, **136** of them have at least one
44
+ positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness,
45
+ but it had no positives, so the model effectively never predicts it).
46
+ - The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are
47
+ in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)**
48
+ dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).
49
+
50
+ This model was **trained and evaluated on the
51
+ [chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
52
+ dataset** (revised CT-RATE, 137-leaf taxonomy).
53
 
54
  **Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:
55
 
 
91
 
92
  ```python
93
  res = model.score_reports(gt_reports, pred_reports, tokenizer=tok) # equal-length lists
94
+ # scores are reported at three hierarchy levels:
95
+ for level in ("leaf", "upper", "anatomy"):
96
+ b = res[level]
97
+ print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
98
+ print(res["leaf"]["per_label"]["Pleural effusion"]) # {'precision':..,'recall':..,'f1':..,'support_gt':..}
99
 
100
  # or one-liner that loads the model for you:
101
  from modeling_chest2vec_labeler import report_f1
102
  report_f1(gt_reports, pred_reports, tokenizer=tok)
103
  ```
104
 
105
+ Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns
106
+ `micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the
107
+ max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser
108
+ levels are easier to match, so upper/anatomy F1 are typically higher than leaf.
109
 
110
  ## Inputs & conventions
111
 
 
117
 
118
  ## Evaluation
119
 
120
+ **How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph**
121
+ mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and
122
+ `eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax
123
+ probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by
124
+ sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that
125
+ eval set **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy
126
+ rows are the hierarchy roll-up.
127
+
128
+ **CT-RATE revised test (public, 1,464 reports)** from `chest2vec/chest2vec_labels` test split:
129
+
130
+ | Level | # labels | macro-F1 @0.33 | macro-AUC |
131
+ |---|--:|--:|--:|
132
+ | leaf (≥30 positives) | 53 | **0.875** | 0.989 |
133
+ | leaf (all evaluated) | 131 | 0.749 | — |
134
+ | upper (≥30 positives) | 27 | 0.938 | 0.994 |
135
+ | anatomy | 10 | 0.956 | 0.993 |
136
+
137
+ **Private evaluation set (1,000 reports)** — a held-out internal set, not released:
138
+
139
+ | Level | # labels | macro-F1 @0.33 | macro-AUC |
140
+ |---|--:|--:|--:|
141
+ | leaf (≥30 positives) | 29 | **0.766** | 0.972 |
142
+ | leaf (all evaluated) | 60 | 0.731 | — |
143
+ | upper (≥30 positives) | 19 | 0.837 | — |
144
+ | anatomy | 10 | 0.869 | — |
145
+
146
+ Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to
147
+ the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure.
148
+ Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully
149
+ accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)).
150
 
151
  ## Caveats
152