---
license: cc-by-nc-4.0
language:
- en
library_name: pytorch
tags:
- radiology
- chest-ct
- report-evaluation
- score
- medical
- rexval
datasets:
- chest2vec/chest2error-bench
base_model: chest2vec/chest2vec_0.6b
pipeline_tag: text-classification
---

# chest2err — Sentence-grounded Error Score for Chest CT Reports

**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.

The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.

Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** — no further downloads are required at inference time.

Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).

## The chest2err-score

```
chest2err_score = exp(−K_total / τ)        # τ = 3.0 (default)
```

where `K_total` is the total number of error tuples emitted by the decoder and `τ` is a display temperature (`score_temperature` in `chest2err_config.json`).

| chest2err-score | K_total | interpretation |
|---:|---:|---|
| **1.00** | 0 | perfect — no errors detected |
| 0.72 | 1 | one error |
| 0.51 | 2 | two errors |
| 0.37 | 3 | substantial errors |
| 0.19 | ≥ 5 | severely degraded |

Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**

The temperature `τ` only rescales the displayed number for human readability — a single error no longer collapses the score. Set `τ=1.0` to recover the original `exp(−K_total)` scale (1 → 0.37, 2 → 0.14). Because `exp(−K_total/τ)` is a strictly monotone function of `K_total` for any `τ>0`, the score is **rank-equivalent to `−K_total`**, so all Kendall τ_b benchmarks transfer unchanged from the count form regardless of `τ`.

## Headline metrics

Evaluated on the 400-pair `chest2error-bench` gold set:

| metric | value |
|---|---|
| Kendall τ_b vs total errors | +0.665 |
| **Kendall τ_b vs Critical errors** (radiologist labels) | **+0.763** |
| Kendall τ_b vs severity-weighted errors (radiologist labels) | +0.734 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
| Critical-error AUROC | 0.963 |
| MAE of K_total | 1.12 |
| **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |

The τ_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.

For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.

### CXR/CT generalization

| corpus | τ_b vs Critical |
|---|---|
| ReXVal (CXR, n=200) | +0.682 |
| Chest CT (this benchmark, n=400) | **+0.763** |

Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT — because it was trained on CT.

## Architecture

| component | spec |
|---|---|
| Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) — fully merged into this repo |
| chest2err LoRA | rank 32, α 64, dropout 0.05 — merged into the backbone weights shipped here |
| Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
| Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) |
| Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` |
| Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |

The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) − 1`.

Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order.

## Files

| file | size | purpose |
|---|---|---|
| `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) |
| `config.json` | <1 KB | backbone architecture config |
| `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads |
| `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) |
| `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) |
| `chest2err_config.json` | <1 KB | chest2err model meta-config |
| `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files |

Total: ~1.36 GB. Everything required to run chest2err is in this repository.

## Quick start

```python
from chest2err import chest2err_score, chest2err_detail

ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."

score = chest2err_score(ref, cand)
# 0.37 — substantial errors (K_total = 3, τ = 3.0)

detail = chest2err_detail(ref, cand)
# detail["score"]           — chest2err-score in (0, 1]
# detail["K_total"]         — integer total error count
# detail["tuples"]          — list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
# detail["category_counts"] — per-category breakdown
# detail["anatomy_counts"]  — per-anatomy breakdown
```

The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.

## Output schema

The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total / τ)` with `τ = 3.0` as above). The score is backed by a sequence of structured error tuples:

```python
{
    "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
    "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
    "concept":      int,  # leaf concept id (clinical finding vocabulary)
    "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
    "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
}
```

`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total / τ)` with `τ = 3.0`.

## Training data

Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).

### Variant generation (LLM-injected errors)

Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:

- **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
- **target finding concept** (leaf finding from the chest CT vocabulary)

Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input.

### Training objective

Supervised teacher-forced training on the LLM-labeled error sequences:

- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)

Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.

### Why this works

- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time.
- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.

## Limitations

- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total / τ)` (τ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
- **English only.** Trained on English chest CT reports from CT-RATE.
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
- **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
- **Single-radiologist gold.** Inter-rater calibration is in progress.

## Citations

If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:

```bibtex
@misc{rexval2023,
  title     = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
  author    = {Yu, F. and Endo, M. and Krishnan, R. and others},
  year      = {2023},
  publisher = {PhysioNet},
  url       = {https://physionet.org/content/rexval-dataset/1.0.0/}
}

@misc{hamamci2024ctrate,
  title         = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
  author        = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
  year          = {2024},
  eprint        = {2403.17834},
  archivePrefix = {arXiv},
  url           = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
}

@misc{chest2err2026,
  title  = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
  author = {chest2vec contributors},
  year   = {2026},
  url    = {https://huggingface.co/chest2vec/chest2err}
}
```

## Related

- **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) — the chest2vec encoder this model is built on
- **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) — radiologist-labeled 400-pair gold set
- **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) — Radiologist-Verified Evaluation, chest X-ray (n=200)
- **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) — chest CT volumes + radiology reports corpus

## License

CC-BY-NC-4.0. Released for research use.