chest2err / README.md
lukeingawesome's picture
Soften chest2err-score: add display temperature tau=3.0 (default)
06de0a9 verified
|
Raw
History Blame Contribute Delete
12.4 kB
---
license: cc-by-nc-4.0
language:
- en
library_name: pytorch
tags:
- radiology
- chest-ct
- report-evaluation
- score
- medical
- rexval
datasets:
- chest2vec/chest2error-bench
base_model: chest2vec/chest2vec_0.6b
pipeline_tag: text-classification
---
# chest2err β€” Sentence-grounded Error Score for Chest CT Reports
**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.
The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** β€” no further downloads are required at inference time.
Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
## The chest2err-score
```
chest2err_score = exp(βˆ’K_total / Ο„) # Ο„ = 3.0 (default)
```
where `K_total` is the total number of error tuples emitted by the decoder and `Ο„` is a display temperature (`score_temperature` in `chest2err_config.json`).
| chest2err-score | K_total | interpretation |
|---:|---:|---|
| **1.00** | 0 | perfect β€” no errors detected |
| 0.72 | 1 | one error |
| 0.51 | 2 | two errors |
| 0.37 | 3 | substantial errors |
| 0.19 | β‰₯ 5 | severely degraded |
Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
The temperature `Ο„` only rescales the displayed number for human readability β€” a single error no longer collapses the score. Set `Ο„=1.0` to recover the original `exp(βˆ’K_total)` scale (1 β†’ 0.37, 2 β†’ 0.14). Because `exp(βˆ’K_total/Ο„)` is a strictly monotone function of `K_total` for any `Ο„>0`, the score is **rank-equivalent to `βˆ’K_total`**, so all Kendall Ο„_b benchmarks transfer unchanged from the count form regardless of `Ο„`.
## Headline metrics
Evaluated on the 400-pair `chest2error-bench` gold set:
| metric | value |
|---|---|
| Kendall Ο„_b vs total errors | +0.665 |
| **Kendall Ο„_b vs Critical errors** (radiologist labels) | **+0.763** |
| Kendall Ο„_b vs severity-weighted errors (radiologist labels) | +0.734 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
| Critical-error AUROC | 0.963 |
| MAE of K_total | 1.12 |
| **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) |
The Ο„_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.
For comparison on the same benchmark: BLEU Ο„_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β‰₯ +0.23 Ο„_b**.
### CXR/CT generalization
| corpus | Ο„_b vs Critical |
|---|---|
| ReXVal (CXR, n=200) | +0.682 |
| Chest CT (this benchmark, n=400) | **+0.763** |
Most prior metrics lose 0.4–0.7 Ο„_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT β€” because it was trained on CT.
## Architecture
| component | spec |
|---|---|
| Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) β€” fully merged into this repo |
| chest2err LoRA | rank 32, Ξ± 64, dropout 0.05 β€” merged into the backbone weights shipped here |
| Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
| Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) |
| Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` |
| Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |
The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) βˆ’ 1`.
Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order.
## Files
| file | size | purpose |
|---|---|---|
| `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) |
| `config.json` | <1 KB | backbone architecture config |
| `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads |
| `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) |
| `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) |
| `chest2err_config.json` | <1 KB | chest2err model meta-config |
| `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files |
Total: ~1.36 GB. Everything required to run chest2err is in this repository.
## Quick start
```python
from chest2err import chest2err_score, chest2err_detail
ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
score = chest2err_score(ref, cand)
# 0.37 β€” substantial errors (K_total = 3, Ο„ = 3.0)
detail = chest2err_detail(ref, cand)
# detail["score"] β€” chest2err-score in (0, 1]
# detail["K_total"] β€” integer total error count
# detail["tuples"] β€” list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
# detail["category_counts"] β€” per-category breakdown
# detail["anatomy_counts"] β€” per-anatomy breakdown
```
The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.
## Output schema
The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’K_total / Ο„)` with `Ο„ = 3.0` as above). The score is backed by a sequence of structured error tuples:
```python
{
"cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
"anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
"concept": int, # leaf concept id (clinical finding vocabulary)
"ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
"cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
}
```
`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) βˆ’ 1`, and `chest2err_score = exp(βˆ’K_total / Ο„)` with `Ο„ = 3.0`.
## Training data
Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).
### Variant generation (LLM-injected errors)
Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:
- **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
- **target finding concept** (leaf finding from the chest CT vocabulary)
Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input.
### Training objective
Supervised teacher-forced training on the LLM-labeled error sequences:
- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.
### Why this works
- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time.
- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with Ο„_b vs Critical = +0.763.
- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** β€” every emitted error tuple cites its source sentences.
## Limitations
- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βˆ’K_total / Ο„)` (Ο„ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
- **English only.** Trained on English chest CT reports from CT-RATE.
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
- **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
- **Single-radiologist gold.** Inter-rater calibration is in progress.
## Citations
If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:
```bibtex
@misc{rexval2023,
title = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
author = {Yu, F. and Endo, M. and Krishnan, R. and others},
year = {2023},
publisher = {PhysioNet},
url = {https://physionet.org/content/rexval-dataset/1.0.0/}
}
@misc{hamamci2024ctrate,
title = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
author = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
year = {2024},
eprint = {2403.17834},
archivePrefix = {arXiv},
url = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
}
@misc{chest2err2026,
title = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
author = {chest2vec contributors},
year = {2026},
url = {https://huggingface.co/chest2vec/chest2err}
}
```
## Related
- **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) β€” the chest2vec encoder this model is built on
- **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) β€” radiologist-labeled 400-pair gold set
- **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) β€” Radiologist-Verified Evaluation, chest X-ray (n=200)
- **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) β€” chest CT volumes + radiology reports corpus
## License
CC-BY-NC-4.0. Released for research use.