File size: 12,358 Bytes
8a9746d 47f2d5a 8a9746d 47f2d5a 8a9746d b7103ad 8a9746d 06de0a9 8a9746d 47f2d5a b7103ad 47f2d5a 8a9746d b7103ad 06de0a9 b7103ad 06de0a9 b7103ad b743033 b7103ad b743033 06de0a9 b7103ad 06de0a9 b743033 8a9746d 47f2d5a 8a9746d b7103ad 8a9746d 47f2d5a b743033 8a9746d 47f2d5a 8a9746d 47f2d5a 8a9746d 47f2d5a 8a9746d b7103ad 47f2d5a b7103ad 06de0a9 b7103ad 47f2d5a 8a9746d 47f2d5a 8a9746d 06de0a9 8a9746d 06de0a9 8a9746d 33781fc 8a9746d 33781fc 47f2d5a 33781fc b743033 47f2d5a 33781fc 47f2d5a 33781fc 8a9746d 06de0a9 47f2d5a 8a9746d 47f2d5a 8a9746d 47f2d5a 8a9746d 47f2d5a 8a9746d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | ---
license: cc-by-nc-4.0
language:
- en
library_name: pytorch
tags:
- radiology
- chest-ct
- report-evaluation
- score
- medical
- rexval
datasets:
- chest2vec/chest2error-bench
base_model: chest2vec/chest2vec_0.6b
pipeline_tag: text-classification
---
# chest2err β Sentence-grounded Error Score for Chest CT Reports
**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score β (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.
The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** β no further downloads are required at inference time.
Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
## The chest2err-score
```
chest2err_score = exp(βK_total / Ο) # Ο = 3.0 (default)
```
where `K_total` is the total number of error tuples emitted by the decoder and `Ο` is a display temperature (`score_temperature` in `chest2err_config.json`).
| chest2err-score | K_total | interpretation |
|---:|---:|---|
| **1.00** | 0 | perfect β no errors detected |
| 0.72 | 1 | one error |
| 0.51 | 2 | two errors |
| 0.37 | 3 | substantial errors |
| 0.19 | β₯ 5 | severely degraded |
Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
The temperature `Ο` only rescales the displayed number for human readability β a single error no longer collapses the score. Set `Ο=1.0` to recover the original `exp(βK_total)` scale (1 β 0.37, 2 β 0.14). Because `exp(βK_total/Ο)` is a strictly monotone function of `K_total` for any `Ο>0`, the score is **rank-equivalent to `βK_total`**, so all Kendall Ο_b benchmarks transfer unchanged from the count form regardless of `Ο`.
## Headline metrics
Evaluated on the 400-pair `chest2error-bench` gold set:
| metric | value |
|---|---|
| Kendall Ο_b vs total errors | +0.665 |
| **Kendall Ο_b vs Critical errors** (radiologist labels) | **+0.763** |
| Kendall Ο_b vs severity-weighted errors (radiologist labels) | +0.734 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
| Critical-error AUROC | 0.963 |
| MAE of K_total | 1.12 |
| **chest2err-score on GT-S β GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |
The Ο_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.
For comparison on the same benchmark: BLEU Ο_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β₯ +0.23 Ο_b**.
### CXR/CT generalization
| corpus | Ο_b vs Critical |
|---|---|
| ReXVal (CXR, n=200) | +0.682 |
| Chest CT (this benchmark, n=400) | **+0.763** |
Most prior metrics lose 0.4β0.7 Ο_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT β because it was trained on CT.
## Architecture
| component | spec |
|---|---|
| Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) β fully merged into this repo |
| chest2err LoRA | rank 32, Ξ± 64, dropout 0.05 β merged into the backbone weights shipped here |
| Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
| Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) |
| Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` |
| Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |
The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) β 1`.
Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order.
## Files
| file | size | purpose |
|---|---|---|
| `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) |
| `config.json` | <1 KB | backbone architecture config |
| `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads |
| `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) |
| `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) |
| `chest2err_config.json` | <1 KB | chest2err model meta-config |
| `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files |
Total: ~1.36 GB. Everything required to run chest2err is in this repository.
## Quick start
```python
from chest2err import chest2err_score, chest2err_detail
ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
score = chest2err_score(ref, cand)
# 0.37 β substantial errors (K_total = 3, Ο = 3.0)
detail = chest2err_detail(ref, cand)
# detail["score"] β chest2err-score in (0, 1]
# detail["K_total"] β integer total error count
# detail["tuples"] β list of {cat, anat, ref_seg_idx, cand_seg_idx, β¦}
# detail["category_counts"] β per-category breakdown
# detail["anatomy_counts"] β per-anatomy breakdown
```
The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.
## Output schema
The primary output is the **chest2err-score β (0, 1]** (computed from `exp(βK_total / Ο)` with `Ο = 3.0` as above). The score is backed by a sequence of structured error tuples:
```python
{
"cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
"anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
"concept": int, # leaf concept id (clinical finding vocabulary)
"ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
"cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
}
```
`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) β 1`, and `chest2err_score = exp(βK_total / Ο)` with `Ο = 3.0`.
## Training data
Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).
### Variant generation (LLM-injected errors)
Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:
- **error category** (1β6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
- **target finding concept** (leaf finding from the chest CT vocabulary)
Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input.
### Training objective
Supervised teacher-forced training on the LLM-labeled error sequences:
- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.
### Why this works
- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time.
- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with Ο_b vs Critical = +0.763.
- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** β every emitted error tuple cites its source sentences.
## Limitations
- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βK_total / Ο)` (Ο = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
- **English only.** Trained on English chest CT reports from CT-RATE.
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
- **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
- **Single-radiologist gold.** Inter-rater calibration is in progress.
## Citations
If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:
```bibtex
@misc{rexval2023,
title = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
author = {Yu, F. and Endo, M. and Krishnan, R. and others},
year = {2023},
publisher = {PhysioNet},
url = {https://physionet.org/content/rexval-dataset/1.0.0/}
}
@misc{hamamci2024ctrate,
title = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
author = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
year = {2024},
eprint = {2403.17834},
archivePrefix = {arXiv},
url = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
}
@misc{chest2err2026,
title = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
author = {chest2vec contributors},
year = {2026},
url = {https://huggingface.co/chest2vec/chest2err}
}
```
## Related
- **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) β the chest2vec encoder this model is built on
- **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) β radiologist-labeled 400-pair gold set
- **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) β Radiologist-Verified Evaluation, chest X-ray (n=200)
- **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) β chest CT volumes + radiology reports corpus
## License
CC-BY-NC-4.0. Released for research use.
|