File size: 12,358 Bytes
8a9746d
 
 
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
47f2d5a
8a9746d
 
 
b7103ad
8a9746d
06de0a9
8a9746d
47f2d5a
b7103ad
47f2d5a
8a9746d
 
 
b7103ad
 
 
06de0a9
b7103ad
 
06de0a9
b7103ad
b743033
b7103ad
b743033
06de0a9
 
 
 
b7103ad
 
 
06de0a9
b743033
8a9746d
 
 
 
 
 
 
47f2d5a
 
8a9746d
 
b7103ad
 
8a9746d
47f2d5a
b743033
8a9746d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47f2d5a
 
8a9746d
47f2d5a
 
8a9746d
 
 
 
 
 
 
 
47f2d5a
 
 
 
 
 
 
 
 
 
 
8a9746d
 
 
b7103ad
47f2d5a
b7103ad
 
 
 
 
06de0a9
b7103ad
 
47f2d5a
 
 
 
 
8a9746d
 
47f2d5a
8a9746d
 
 
06de0a9
8a9746d
 
 
 
 
 
 
 
 
 
 
06de0a9
8a9746d
 
 
33781fc
8a9746d
33781fc
 
 
 
 
 
 
 
47f2d5a
33781fc
 
 
 
 
 
 
b743033
47f2d5a
33781fc
 
 
47f2d5a
33781fc
 
8a9746d
 
 
06de0a9
47f2d5a
8a9746d
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
license: cc-by-nc-4.0
language:
- en
library_name: pytorch
tags:
- radiology
- chest-ct
- report-evaluation
- score
- medical
- rexval
datasets:
- chest2vec/chest2error-bench
base_model: chest2vec/chest2vec_0.6b
pipeline_tag: text-classification
---

# chest2err β€” Sentence-grounded Error Score for Chest CT Reports

**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.

The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.

Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** β€” no further downloads are required at inference time.

Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).

## The chest2err-score

```
chest2err_score = exp(βˆ’K_total / Ο„)        # Ο„ = 3.0 (default)
```

where `K_total` is the total number of error tuples emitted by the decoder and `Ο„` is a display temperature (`score_temperature` in `chest2err_config.json`).

| chest2err-score | K_total | interpretation |
|---:|---:|---|
| **1.00** | 0 | perfect β€” no errors detected |
| 0.72 | 1 | one error |
| 0.51 | 2 | two errors |
| 0.37 | 3 | substantial errors |
| 0.19 | β‰₯ 5 | severely degraded |

Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**

The temperature `Ο„` only rescales the displayed number for human readability β€” a single error no longer collapses the score. Set `Ο„=1.0` to recover the original `exp(βˆ’K_total)` scale (1 β†’ 0.37, 2 β†’ 0.14). Because `exp(βˆ’K_total/Ο„)` is a strictly monotone function of `K_total` for any `Ο„>0`, the score is **rank-equivalent to `βˆ’K_total`**, so all Kendall Ο„_b benchmarks transfer unchanged from the count form regardless of `Ο„`.

## Headline metrics

Evaluated on the 400-pair `chest2error-bench` gold set:

| metric | value |
|---|---|
| Kendall Ο„_b vs total errors | +0.665 |
| **Kendall Ο„_b vs Critical errors** (radiologist labels) | **+0.763** |
| Kendall Ο„_b vs severity-weighted errors (radiologist labels) | +0.734 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
| Critical-error AUROC | 0.963 |
| MAE of K_total | 1.12 |
| **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |

The Ο„_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.

For comparison on the same benchmark: BLEU Ο„_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β‰₯ +0.23 Ο„_b**.

### CXR/CT generalization

| corpus | Ο„_b vs Critical |
|---|---|
| ReXVal (CXR, n=200) | +0.682 |
| Chest CT (this benchmark, n=400) | **+0.763** |

Most prior metrics lose 0.4–0.7 Ο„_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT β€” because it was trained on CT.

## Architecture

| component | spec |
|---|---|
| Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) β€” fully merged into this repo |
| chest2err LoRA | rank 32, Ξ± 64, dropout 0.05 β€” merged into the backbone weights shipped here |
| Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
| Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) |
| Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` |
| Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |

The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) βˆ’ 1`.

Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order.

## Files

| file | size | purpose |
|---|---|---|
| `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) |
| `config.json` | <1 KB | backbone architecture config |
| `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads |
| `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) |
| `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) |
| `chest2err_config.json` | <1 KB | chest2err model meta-config |
| `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files |

Total: ~1.36 GB. Everything required to run chest2err is in this repository.

## Quick start

```python
from chest2err import chest2err_score, chest2err_detail

ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."

score = chest2err_score(ref, cand)
# 0.37 β€” substantial errors (K_total = 3, Ο„ = 3.0)

detail = chest2err_detail(ref, cand)
# detail["score"]           β€” chest2err-score in (0, 1]
# detail["K_total"]         β€” integer total error count
# detail["tuples"]          β€” list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
# detail["category_counts"] β€” per-category breakdown
# detail["anatomy_counts"]  β€” per-anatomy breakdown
```

The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.

## Output schema

The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’K_total / Ο„)` with `Ο„ = 3.0` as above). The score is backed by a sequence of structured error tuples:

```python
{
    "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
    "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
    "concept":      int,  # leaf concept id (clinical finding vocabulary)
    "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
    "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
}
```

`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) βˆ’ 1`, and `chest2err_score = exp(βˆ’K_total / Ο„)` with `Ο„ = 3.0`.

## Training data

Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).

### Variant generation (LLM-injected errors)

Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:

- **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
- **target finding concept** (leaf finding from the chest CT vocabulary)

Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input.

### Training objective

Supervised teacher-forced training on the LLM-labeled error sequences:

- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)

Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.

### Why this works

- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time.
- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with Ο„_b vs Critical = +0.763.
- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** β€” every emitted error tuple cites its source sentences.

## Limitations

- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βˆ’K_total / Ο„)` (Ο„ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
- **English only.** Trained on English chest CT reports from CT-RATE.
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
- **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
- **Single-radiologist gold.** Inter-rater calibration is in progress.

## Citations

If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:

```bibtex
@misc{rexval2023,
  title     = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
  author    = {Yu, F. and Endo, M. and Krishnan, R. and others},
  year      = {2023},
  publisher = {PhysioNet},
  url       = {https://physionet.org/content/rexval-dataset/1.0.0/}
}

@misc{hamamci2024ctrate,
  title         = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
  author        = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
  year          = {2024},
  eprint        = {2403.17834},
  archivePrefix = {arXiv},
  url           = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
}

@misc{chest2err2026,
  title  = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
  author = {chest2vec contributors},
  year   = {2026},
  url    = {https://huggingface.co/chest2vec/chest2err}
}
```

## Related

- **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) β€” the chest2vec encoder this model is built on
- **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) β€” radiologist-labeled 400-pair gold set
- **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) β€” Radiologist-Verified Evaluation, chest X-ray (n=200)
- **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) β€” chest CT volumes + radiology reports corpus

## License

CC-BY-NC-4.0. Released for research use.