File size: 12,358 Bytes

8a9746d
 
 
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
47f2d5a
8a9746d
 
 
b7103ad
8a9746d
06de0a9
8a9746d
47f2d5a
b7103ad
47f2d5a
8a9746d
 
 
b7103ad
 
 
06de0a9
b7103ad
 
06de0a9
b7103ad
b743033
b7103ad
b743033
06de0a9
 
 
 
b7103ad
 
 
06de0a9
b743033
8a9746d
 
 
 
 
 
 
47f2d5a
 
8a9746d
 
b7103ad
 
8a9746d
47f2d5a
b743033
8a9746d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47f2d5a
 
8a9746d
47f2d5a
 
8a9746d
 
 
 
 
 
 
 
47f2d5a
 
 
 
 
 
 
 
 
 
 
8a9746d
 
 
b7103ad
47f2d5a
b7103ad
 
 
 
 
06de0a9
b7103ad
 
47f2d5a
 
 
 
 
8a9746d
 
47f2d5a
8a9746d
 
 
06de0a9
8a9746d
 
 
 
 
 
 
 
 
 
 
06de0a9
8a9746d
 
 
33781fc
8a9746d
33781fc
 
 
 
 
 
 
 
47f2d5a
33781fc
 
 
 
 
 
 
b743033
47f2d5a
33781fc
 
 
47f2d5a
33781fc
 
8a9746d
 
 
06de0a9
47f2d5a
8a9746d
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47f2d5a
8a9746d
 
 
 
 
 
 
 
47f2d5a
8a9746d

---
license: cc-by-nc-4.0
language:
- en
library_name: pytorch
tags:
- radiology
- chest-ct
- report-evaluation
- score
- medical
- rexval
datasets:
- chest2vec/chest2error-bench
base_model: chest2vec/chest2vec_0.6b
pipeline_tag: text-classification
---

# chest2err — Sentence-grounded Error Score for Chest CT Reports

**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.

The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.

Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** — no further downloads are required at inference time.

Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).

## The chest2err-score

```
chest2err_score = exp(−K_total / τ)        # τ = 3.0 (default)
```

where `K_total` is the total number of error tuples emitted by the decoder and `τ` is a display temperature (`score_temperature` in `chest2err_config.json`).

| chest2err-score | K_total | interpretation |
|---:|---:|---|
| **1.00** | 0 | perfect — no errors detected |
| 0.72 | 1 | one error |
| 0.51 | 2 | two errors |
| 0.37 | 3 | substantial errors |
| 0.19 | ≥ 5 | severely degraded |

Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**

The temperature `τ` only rescales the displayed number for human readability — a single error no longer collapses the score. Set `τ=1.0` to recover the original `exp(−K_total)` scale (1 → 0.37, 2 → 0.14). Because `exp(−K_total/τ)` is a strictly monotone function of `K_total` for any `τ>0`, the score is **rank-equivalent to `−K_total`**, so all Kendall τ_b benchmarks transfer unchanged from the count form regardless of `τ`.

## Headline metrics

Evaluated on the 400-pair `chest2error-bench` gold set:

| metric | value |
|---|---|
| Kendall τ_b vs total errors | +0.665 |
| **Kendall τ_b vs Critical errors** (radiologist labels) | **+0.763** |
| Kendall τ_b vs severity-weighted errors (radiologist labels) | +0.734 |
| **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
| Critical-error AUROC | 0.963 |
| MAE of K_total | 1.12 |
| **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |

The τ_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.

For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.

### CXR/CT generalization

| corpus | τ_b vs Critical |
|---|---|
| ReXVal (CXR, n=200) | +0.682 |
| Chest CT (this benchmark, n=400) | **+0.763** |

Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT — because it was trained on CT.

## Architecture

| component | spec |
|---|---|
| Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) — fully merged into this repo |
| chest2err LoRA | rank 32, α 64, dropout 0.05 — merged into the backbone weights shipped here |
| Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
| Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) |
| Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` |
| Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |

The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) − 1`.

Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order.

## Files

| file | size | purpose |
|---|---|---|
| `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) |
| `config.json` | <1 KB | backbone architecture config |
| `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads |
| `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) |
| `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) |
| `chest2err_config.json` | <1 KB | chest2err model meta-config |
| `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files |

Total: ~1.36 GB. Everything required to run chest2err is in this repository.

## Quick start

```python
from chest2err import chest2err_score, chest2err_detail

ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."

score = chest2err_score(ref, cand)
# 0.37 — substantial errors (K_total = 3, τ = 3.0)

detail = chest2err_detail(ref, cand)
# detail["score"]           — chest2err-score in (0, 1]
# detail["K_total"]         — integer total error count
# detail["tuples"]          — list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
# detail["category_counts"] — per-category breakdown
# detail["anatomy_counts"]  — per-anatomy breakdown
```

The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.

## Output schema

The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total / τ)` with `τ = 3.0` as above). The score is backed by a sequence of structured error tuples:

```python
{
    "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
    "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
    "concept":      int,  # leaf concept id (clinical finding vocabulary)
    "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
    "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
}
```

`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total / τ)` with `τ = 3.0`.

## Training data

Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).

### Variant generation (LLM-injected errors)

Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:

- **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
- **target finding concept** (leaf finding from the chest CT vocabulary)

Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input.

### Training objective

Supervised teacher-forced training on the LLM-labeled error sequences:

- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)

Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.

### Why this works

- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time.
- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.

## Limitations

- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total / τ)` (τ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
- **English only.** Trained on English chest CT reports from CT-RATE.
- **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
- **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
- **Single-radiologist gold.** Inter-rater calibration is in progress.

## Citations

If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:

```bibtex
@misc{rexval2023,
  title     = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
  author    = {Yu, F. and Endo, M. and Krishnan, R. and others},
  year      = {2023},
  publisher = {PhysioNet},
  url       = {https://physionet.org/content/rexval-dataset/1.0.0/}
}

@misc{hamamci2024ctrate,
  title         = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
  author        = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
  year          = {2024},
  eprint        = {2403.17834},
  archivePrefix = {arXiv},
  url           = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
}

@misc{chest2err2026,
  title  = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
  author = {chest2vec contributors},
  year   = {2026},
  url    = {https://huggingface.co/chest2vec/chest2err}
}
```

## Related

- **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) — the chest2vec encoder this model is built on
- **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) — radiologist-labeled 400-pair gold set
- **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) — Radiologist-Verified Evaluation, chest X-ray (n=200)
- **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) — chest CT volumes + radiology reports corpus

## License

CC-BY-NC-4.0. Released for research use.