Soften chest2err-score: add display temperature tau=3.0 (default)
Browse filesexp(-K_total/tau) with tau=3.0; one error -> 0.72 instead of 0.37. Rank-equivalent, so Kendall tau_b benchmarks unchanged. tau=1.0 recovers original scale.
- README.md +12 -12
- chest2err.py +5 -1
- chest2err_config.json +1 -0
README.md
CHANGED
|
@@ -18,7 +18,7 @@ pipeline_tag: text-classification
|
|
| 18 |
|
| 19 |
# chest2err β Sentence-grounded Error Score for Chest CT Reports
|
| 20 |
|
| 21 |
-
**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score β (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.
|
| 22 |
|
| 23 |
The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
|
| 24 |
|
|
@@ -29,22 +29,22 @@ Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datas
|
|
| 29 |
## The chest2err-score
|
| 30 |
|
| 31 |
```
|
| 32 |
-
chest2err_score = exp(βK_total)
|
| 33 |
```
|
| 34 |
|
| 35 |
-
where `K_total` is the total number of error tuples emitted by the decoder.
|
| 36 |
|
| 37 |
| chest2err-score | K_total | interpretation |
|
| 38 |
|---:|---:|---|
|
| 39 |
| **1.00** | 0 | perfect β no errors detected |
|
| 40 |
-
| 0.
|
| 41 |
-
| 0.
|
| 42 |
-
| 0.
|
| 43 |
-
|
|
| 44 |
|
| 45 |
Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
|
| 46 |
|
| 47 |
-
The score is rank-equivalent to `βK_total`, so all Kendall Ο_b benchmarks transfer unchanged from the count form.
|
| 48 |
|
| 49 |
## Headline metrics
|
| 50 |
|
|
@@ -111,7 +111,7 @@ ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
|
|
| 111 |
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
|
| 112 |
|
| 113 |
score = chest2err_score(ref, cand)
|
| 114 |
-
# 0.
|
| 115 |
|
| 116 |
detail = chest2err_detail(ref, cand)
|
| 117 |
# detail["score"] β chest2err-score in (0, 1]
|
|
@@ -125,7 +125,7 @@ The loader picks up the bundled weights automatically; no extra setup beyond `pi
|
|
| 125 |
|
| 126 |
## Output schema
|
| 127 |
|
| 128 |
-
The primary output is the **chest2err-score β (0, 1]** (computed from `exp(βK_total)` as above). The score is backed by a sequence of structured error tuples:
|
| 129 |
|
| 130 |
```python
|
| 131 |
{
|
|
@@ -137,7 +137,7 @@ The primary output is the **chest2err-score β (0, 1]** (computed from `exp(β
|
|
| 137 |
}
|
| 138 |
```
|
| 139 |
|
| 140 |
-
`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) β 1`, and `chest2err_score = exp(βK_total)`.
|
| 141 |
|
| 142 |
## Training data
|
| 143 |
|
|
@@ -170,7 +170,7 @@ Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive
|
|
| 170 |
|
| 171 |
## Limitations
|
| 172 |
|
| 173 |
-
- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βK_total)` treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
|
| 174 |
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
|
| 175 |
- **English only.** Trained on English chest CT reports from CT-RATE.
|
| 176 |
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
|
|
|
|
| 18 |
|
| 19 |
# chest2err β Sentence-grounded Error Score for Chest CT Reports
|
| 20 |
|
| 21 |
+
**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score β (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.
|
| 22 |
|
| 23 |
The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
|
| 24 |
|
|
|
|
| 29 |
## The chest2err-score
|
| 30 |
|
| 31 |
```
|
| 32 |
+
chest2err_score = exp(βK_total / Ο) # Ο = 3.0 (default)
|
| 33 |
```
|
| 34 |
|
| 35 |
+
where `K_total` is the total number of error tuples emitted by the decoder and `Ο` is a display temperature (`score_temperature` in `chest2err_config.json`).
|
| 36 |
|
| 37 |
| chest2err-score | K_total | interpretation |
|
| 38 |
|---:|---:|---|
|
| 39 |
| **1.00** | 0 | perfect β no errors detected |
|
| 40 |
+
| 0.72 | 1 | one error |
|
| 41 |
+
| 0.51 | 2 | two errors |
|
| 42 |
+
| 0.37 | 3 | substantial errors |
|
| 43 |
+
| 0.19 | β₯ 5 | severely degraded |
|
| 44 |
|
| 45 |
Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
|
| 46 |
|
| 47 |
+
The temperature `Ο` only rescales the displayed number for human readability β a single error no longer collapses the score. Set `Ο=1.0` to recover the original `exp(βK_total)` scale (1 β 0.37, 2 β 0.14). Because `exp(βK_total/Ο)` is a strictly monotone function of `K_total` for any `Ο>0`, the score is **rank-equivalent to `βK_total`**, so all Kendall Ο_b benchmarks transfer unchanged from the count form regardless of `Ο`.
|
| 48 |
|
| 49 |
## Headline metrics
|
| 50 |
|
|
|
|
| 111 |
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
|
| 112 |
|
| 113 |
score = chest2err_score(ref, cand)
|
| 114 |
+
# 0.37 β substantial errors (K_total = 3, Ο = 3.0)
|
| 115 |
|
| 116 |
detail = chest2err_detail(ref, cand)
|
| 117 |
# detail["score"] β chest2err-score in (0, 1]
|
|
|
|
| 125 |
|
| 126 |
## Output schema
|
| 127 |
|
| 128 |
+
The primary output is the **chest2err-score β (0, 1]** (computed from `exp(βK_total / Ο)` with `Ο = 3.0` as above). The score is backed by a sequence of structured error tuples:
|
| 129 |
|
| 130 |
```python
|
| 131 |
{
|
|
|
|
| 137 |
}
|
| 138 |
```
|
| 139 |
|
| 140 |
+
`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) β 1`, and `chest2err_score = exp(βK_total / Ο)` with `Ο = 3.0`.
|
| 141 |
|
| 142 |
## Training data
|
| 143 |
|
|
|
|
| 170 |
|
| 171 |
## Limitations
|
| 172 |
|
| 173 |
+
- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βK_total / Ο)` (Ο = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
|
| 174 |
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
|
| 175 |
- **English only.** Trained on English chest CT reports from CT-RATE.
|
| 176 |
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
|
chest2err.py
CHANGED
|
@@ -50,6 +50,10 @@ class Chest2Err:
|
|
| 50 |
self.cfg = cfg
|
| 51 |
self.device = device
|
| 52 |
self.max_length = cfg["max_length"]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
# Concept vocab (size determines decoder output head dim)
|
| 55 |
with open(PACKAGE_DIR / "concept2id.json") as f:
|
|
@@ -114,7 +118,7 @@ class Chest2Err:
|
|
| 114 |
)
|
| 115 |
seq = seqs[0]
|
| 116 |
K_total = len(seq)
|
| 117 |
-
score = math.exp(-K_total)
|
| 118 |
|
| 119 |
cat_counts = [0] * self.cfg["n_cat"]
|
| 120 |
anat_counts = [0] * self.cfg["n_anat"]
|
|
|
|
| 50 |
self.cfg = cfg
|
| 51 |
self.device = device
|
| 52 |
self.max_length = cfg["max_length"]
|
| 53 |
+
# Display temperature Ο for the score exp(-K_total/Ο). Ο=3.0 is the
|
| 54 |
+
# default gentle setting (one error β 0.72); Ο=1.0 reproduces the
|
| 55 |
+
# original exp(-K_total). Rank-equivalent, so Ο never affects Ο_b.
|
| 56 |
+
self.score_temperature = float(cfg.get("score_temperature", 3.0))
|
| 57 |
|
| 58 |
# Concept vocab (size determines decoder output head dim)
|
| 59 |
with open(PACKAGE_DIR / "concept2id.json") as f:
|
|
|
|
| 118 |
)
|
| 119 |
seq = seqs[0]
|
| 120 |
K_total = len(seq)
|
| 121 |
+
score = math.exp(-K_total / self.score_temperature)
|
| 122 |
|
| 123 |
cat_counts = [0] * self.cfg["n_cat"]
|
| 124 |
anat_counts = [0] * self.cfg["n_anat"]
|
chest2err_config.json
CHANGED
|
@@ -11,5 +11,6 @@
|
|
| 11 |
"decoder_ff": 2048,
|
| 12 |
"decoder_dropout": 0.1,
|
| 13 |
"max_decode_steps": 24,
|
|
|
|
| 14 |
"input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
|
| 15 |
}
|
|
|
|
| 11 |
"decoder_ff": 2048,
|
| 12 |
"decoder_dropout": 0.1,
|
| 13 |
"max_decode_steps": 24,
|
| 14 |
+
"score_temperature": 3.0,
|
| 15 |
"input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
|
| 16 |
}
|