lukeingawesome commited on
Commit
b743033
Β·
verified Β·
1 Parent(s): 33781fc

Honest fix: chest2err-score uses K_total (severity head not reliably trained in v0.1)

Browse files
Files changed (1) hide show
  1. README.md +18 -14
README.md CHANGED
@@ -30,24 +30,24 @@ Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datas
30
  ## The chest2err-score
31
 
32
  ```
33
- chest2err_score = exp(βˆ’K_w)
34
- K_w = K_critical + 0.25 Γ— K_minor
35
  ```
36
 
37
- where `K_critical` and `K_minor` are the counts of Critical and Minor errors emitted by the decoder.
38
 
39
- | chest2err-score | K_w | interpretation |
40
  |---:|---:|---|
41
- | **1.00** | 0 | perfect β€” no errors |
42
- | 0.78 | 0.25 | one Minor error |
43
- | 0.37 | 1 | one Critical error |
44
- | 0.14 | 2 | two Critical (or 1 Critical + 4 Minor) |
45
  | 0.05 | 3 | substantial errors |
46
  | < 0.01 | β‰₯ 5 | severely degraded |
47
 
48
  Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
49
 
50
- The score is rank-equivalent to `βˆ’K_w`, so all Kendall Ο„_b benchmarks transfer unchanged from the count form.
 
 
51
 
52
  ## Headline metrics
53
 
@@ -55,14 +55,16 @@ Evaluated on the 400-pair `chest2error-bench` gold set:
55
 
56
  | metric | value |
57
  |---|---|
58
- | **Kendall Ο„_b vs Critical errors** | **+0.763** |
59
  | Kendall Ο„_b vs total errors | +0.665 |
 
60
  | Kendall Ο„_b vs severity-weighted | +0.734 |
61
  | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
62
  | Critical-error AUROC | 0.963 |
63
  | MAE of K_total | 1.12 |
64
  | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |
65
 
 
 
66
  For comparison on the same benchmark: BLEU Ο„_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β‰₯ +0.23 Ο„_b**.
67
 
68
  ### CXR/CT generalization
@@ -135,20 +137,20 @@ A self-contained HF `from_pretrained` loader is on the roadmap. Until then, infe
135
 
136
  ## Output schema
137
 
138
- The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’K_w)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
139
 
140
  ```python
141
  {
142
  "cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
143
  "anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
144
  "concept": int, # leaf concept id (clinical finding vocabulary)
145
- "severity": int, # 0 = Minor, 1 = Critical
146
  "ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
147
  "cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
148
  }
149
  ```
150
 
151
- `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) βˆ’ 1`. Then `K_critical = sum(severity == 1)`, `K_minor = sum(severity == 0)`, and `score = exp(βˆ’(K_critical + 0.25 Γ— K_minor))`.
152
 
153
  ## Training data
154
 
@@ -170,7 +172,8 @@ Supervised teacher-forced training on the LLM-labeled error sequences:
170
 
171
  - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
172
  - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
173
- - **Severity loss** on the `severity` head (Critical / Minor β€” added on the radiologist-labeled validation subset)
 
174
 
175
  Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
176
 
@@ -182,6 +185,7 @@ Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the
182
 
183
  ## Limitations
184
 
 
185
  - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
186
  - **English only.** Trained on English chest CT reports from CT-RATE.
187
  - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
 
30
  ## The chest2err-score
31
 
32
  ```
33
+ chest2err_score = exp(βˆ’K_total)
 
34
  ```
35
 
36
+ where `K_total` is the total number of error tuples emitted by the decoder.
37
 
38
+ | chest2err-score | K_total | interpretation |
39
  |---:|---:|---|
40
+ | **1.00** | 0 | perfect β€” no errors detected |
41
+ | 0.37 | 1 | one error |
42
+ | 0.14 | 2 | two errors |
 
43
  | 0.05 | 3 | substantial errors |
44
  | < 0.01 | β‰₯ 5 | severely degraded |
45
 
46
  Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
47
 
48
+ The score is rank-equivalent to `βˆ’K_total`, so all Kendall Ο„_b benchmarks transfer unchanged from the count form.
49
+
50
+ > **Note on severity weighting.** The decoder also emits a `severity ∈ {Minor, Critical}` field per error tuple. However, the LLM-generated training corpus does **not** include severity labels β€” only the 200-variant radiologist-labeled validation slice does β€” so the severity head is **not currently reliably trained**. Until a severity-labeled training set is released, the canonical chest2err-score uses **`K_total` directly** (every emitted error weighted equally). A severity-weighted variant of the form `K_w = K_critical + 0.25 Γ— K_minor` will become the recommended formulation once the severity head is properly fine-tuned.
51
 
52
  ## Headline metrics
53
 
 
55
 
56
  | metric | value |
57
  |---|---|
 
58
  | Kendall Ο„_b vs total errors | +0.665 |
59
+ | **Kendall Ο„_b vs Critical errors** | **+0.763** |
60
  | Kendall Ο„_b vs severity-weighted | +0.734 |
61
  | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
62
  | Critical-error AUROC | 0.963 |
63
  | MAE of K_total | 1.12 |
64
  | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |
65
 
66
+ The Critical and severity-weighted Ο„_b numbers are computed using the **radiologist's severity labels** in the gold set (not the model's severity output). They show that the predicted K_total correlates strongly with the human Critical-error count even without explicit severity supervision β€” once a severity-labeled training corpus is added, these numbers should improve further.
67
+
68
  For comparison on the same benchmark: BLEU Ο„_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β‰₯ +0.23 Ο„_b**.
69
 
70
  ### CXR/CT generalization
 
137
 
138
  ## Output schema
139
 
140
+ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’K_total)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
141
 
142
  ```python
143
  {
144
  "cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
145
  "anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
146
  "concept": int, # leaf concept id (clinical finding vocabulary)
147
+ "severity": int, # 0 = Minor, 1 = Critical (not reliably trained in v0.1 β€” see severity-weighting note above)
148
  "ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
149
  "cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
150
  }
151
  ```
152
 
153
+ `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) βˆ’ 1`, and `chest2err_score = exp(βˆ’K_total)`.
154
 
155
  ## Training data
156
 
 
172
 
173
  - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
174
  - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
175
+
176
+ Note: a `severity` head exists in the architecture but is **not reliably trained in v0.1** β€” GPT-4o-mini's variant labels don't include Critical/Minor severity, and the 200-row radiologist subset is too small a signal on its own. Severity output is therefore not part of the canonical chest2err-score in this release. Adding a severity-labeled training set is the headline item on the roadmap.
177
 
178
  Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
179
 
 
185
 
186
  ## Limitations
187
 
188
+ - **Severity output not reliable in v0.1.** The decoder emits a Critical / Minor severity per error tuple, but its training signal is too thin (GPT-4o-mini's variant labels don't include severity). Use the canonical `chest2err_score = exp(βˆ’K_total)` and ignore the severity field until a severity-labeled training set is released.
189
  - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
190
  - **English only.** Trained on English chest CT reports from CT-RATE.
191
  - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.