Honest fix: chest2err-score uses K_total (severity head not reliably trained in v0.1)
Browse files
README.md
CHANGED
|
@@ -30,24 +30,24 @@ Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datas
|
|
| 30 |
## The chest2err-score
|
| 31 |
|
| 32 |
```
|
| 33 |
-
chest2err_score = exp(β
|
| 34 |
-
K_w = K_critical + 0.25 Γ K_minor
|
| 35 |
```
|
| 36 |
|
| 37 |
-
where `
|
| 38 |
|
| 39 |
-
| chest2err-score |
|
| 40 |
|---:|---:|---|
|
| 41 |
-
| **1.00** | 0 | perfect β no errors |
|
| 42 |
-
| 0.
|
| 43 |
-
| 0.
|
| 44 |
-
| 0.14 | 2 | two Critical (or 1 Critical + 4 Minor) |
|
| 45 |
| 0.05 | 3 | substantial errors |
|
| 46 |
| < 0.01 | β₯ 5 | severely degraded |
|
| 47 |
|
| 48 |
Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
|
| 49 |
|
| 50 |
-
The score is rank-equivalent to `β
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## Headline metrics
|
| 53 |
|
|
@@ -55,14 +55,16 @@ Evaluated on the 400-pair `chest2error-bench` gold set:
|
|
| 55 |
|
| 56 |
| metric | value |
|
| 57 |
|---|---|
|
| 58 |
-
| **Kendall Ο_b vs Critical errors** | **+0.763** |
|
| 59 |
| Kendall Ο_b vs total errors | +0.665 |
|
|
|
|
| 60 |
| Kendall Ο_b vs severity-weighted | +0.734 |
|
| 61 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
|
| 62 |
| Critical-error AUROC | 0.963 |
|
| 63 |
| MAE of K_total | 1.12 |
|
| 64 |
| **chest2err-score on GT-S β GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |
|
| 65 |
|
|
|
|
|
|
|
| 66 |
For comparison on the same benchmark: BLEU Ο_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β₯ +0.23 Ο_b**.
|
| 67 |
|
| 68 |
### CXR/CT generalization
|
|
@@ -135,20 +137,20 @@ A self-contained HF `from_pretrained` loader is on the roadmap. Until then, infe
|
|
| 135 |
|
| 136 |
## Output schema
|
| 137 |
|
| 138 |
-
The primary output is the **chest2err-score β (0, 1]** (computed from `exp(β
|
| 139 |
|
| 140 |
```python
|
| 141 |
{
|
| 142 |
"cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
|
| 143 |
"anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
|
| 144 |
"concept": int, # leaf concept id (clinical finding vocabulary)
|
| 145 |
-
"severity": int, # 0 = Minor, 1 = Critical
|
| 146 |
"ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
|
| 147 |
"cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
|
| 148 |
}
|
| 149 |
```
|
| 150 |
|
| 151 |
-
`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) β 1`
|
| 152 |
|
| 153 |
## Training data
|
| 154 |
|
|
@@ -170,7 +172,8 @@ Supervised teacher-forced training on the LLM-labeled error sequences:
|
|
| 170 |
|
| 171 |
- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
|
| 172 |
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
|
| 173 |
-
|
|
|
|
| 174 |
|
| 175 |
Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
|
| 176 |
|
|
@@ -182,6 +185,7 @@ Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the
|
|
| 182 |
|
| 183 |
## Limitations
|
| 184 |
|
|
|
|
| 185 |
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
|
| 186 |
- **English only.** Trained on English chest CT reports from CT-RATE.
|
| 187 |
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
|
|
|
|
| 30 |
## The chest2err-score
|
| 31 |
|
| 32 |
```
|
| 33 |
+
chest2err_score = exp(βK_total)
|
|
|
|
| 34 |
```
|
| 35 |
|
| 36 |
+
where `K_total` is the total number of error tuples emitted by the decoder.
|
| 37 |
|
| 38 |
+
| chest2err-score | K_total | interpretation |
|
| 39 |
|---:|---:|---|
|
| 40 |
+
| **1.00** | 0 | perfect β no errors detected |
|
| 41 |
+
| 0.37 | 1 | one error |
|
| 42 |
+
| 0.14 | 2 | two errors |
|
|
|
|
| 43 |
| 0.05 | 3 | substantial errors |
|
| 44 |
| < 0.01 | β₯ 5 | severely degraded |
|
| 45 |
|
| 46 |
Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
|
| 47 |
|
| 48 |
+
The score is rank-equivalent to `βK_total`, so all Kendall Ο_b benchmarks transfer unchanged from the count form.
|
| 49 |
+
|
| 50 |
+
> **Note on severity weighting.** The decoder also emits a `severity β {Minor, Critical}` field per error tuple. However, the LLM-generated training corpus does **not** include severity labels β only the 200-variant radiologist-labeled validation slice does β so the severity head is **not currently reliably trained**. Until a severity-labeled training set is released, the canonical chest2err-score uses **`K_total` directly** (every emitted error weighted equally). A severity-weighted variant of the form `K_w = K_critical + 0.25 Γ K_minor` will become the recommended formulation once the severity head is properly fine-tuned.
|
| 51 |
|
| 52 |
## Headline metrics
|
| 53 |
|
|
|
|
| 55 |
|
| 56 |
| metric | value |
|
| 57 |
|---|---|
|
|
|
|
| 58 |
| Kendall Ο_b vs total errors | +0.665 |
|
| 59 |
+
| **Kendall Ο_b vs Critical errors** | **+0.763** |
|
| 60 |
| Kendall Ο_b vs severity-weighted | +0.734 |
|
| 61 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
|
| 62 |
| Critical-error AUROC | 0.963 |
|
| 63 |
| MAE of K_total | 1.12 |
|
| 64 |
| **chest2err-score on GT-S β GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |
|
| 65 |
|
| 66 |
+
The Critical and severity-weighted Ο_b numbers are computed using the **radiologist's severity labels** in the gold set (not the model's severity output). They show that the predicted K_total correlates strongly with the human Critical-error count even without explicit severity supervision β once a severity-labeled training corpus is added, these numbers should improve further.
|
| 67 |
+
|
| 68 |
For comparison on the same benchmark: BLEU Ο_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β₯ +0.23 Ο_b**.
|
| 69 |
|
| 70 |
### CXR/CT generalization
|
|
|
|
| 137 |
|
| 138 |
## Output schema
|
| 139 |
|
| 140 |
+
The primary output is the **chest2err-score β (0, 1]** (computed from `exp(βK_total)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
|
| 141 |
|
| 142 |
```python
|
| 143 |
{
|
| 144 |
"cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
|
| 145 |
"anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
|
| 146 |
"concept": int, # leaf concept id (clinical finding vocabulary)
|
| 147 |
+
"severity": int, # 0 = Minor, 1 = Critical (not reliably trained in v0.1 β see severity-weighting note above)
|
| 148 |
"ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
|
| 149 |
"cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
|
| 150 |
}
|
| 151 |
```
|
| 152 |
|
| 153 |
+
`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) β 1`, and `chest2err_score = exp(βK_total)`.
|
| 154 |
|
| 155 |
## Training data
|
| 156 |
|
|
|
|
| 172 |
|
| 173 |
- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
|
| 174 |
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
|
| 175 |
+
|
| 176 |
+
Note: a `severity` head exists in the architecture but is **not reliably trained in v0.1** β GPT-4o-mini's variant labels don't include Critical/Minor severity, and the 200-row radiologist subset is too small a signal on its own. Severity output is therefore not part of the canonical chest2err-score in this release. Adding a severity-labeled training set is the headline item on the roadmap.
|
| 177 |
|
| 178 |
Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
|
| 179 |
|
|
|
|
| 185 |
|
| 186 |
## Limitations
|
| 187 |
|
| 188 |
+
- **Severity output not reliable in v0.1.** The decoder emits a Critical / Minor severity per error tuple, but its training signal is too thin (GPT-4o-mini's variant labels don't include severity). Use the canonical `chest2err_score = exp(βK_total)` and ignore the severity field until a severity-labeled training set is released.
|
| 189 |
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
|
| 190 |
- **English only.** Trained on English chest CT reports from CT-RATE.
|
| 191 |
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
|