--- license: cc-by-nc-4.0 language: - en library_name: pytorch tags: - radiology - chest-ct - report-evaluation - score - medical - rexval datasets: - chest2vec/chest2error-bench base_model: chest2vec/chest2vec_0.6b pipeline_tag: text-classification --- # chest2err — Sentence-grounded Error Score for Chest CT Reports **chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors. The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations. Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** — no further downloads are required at inference time. Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience). ## The chest2err-score ``` chest2err_score = exp(−K_total / τ) # τ = 3.0 (default) ``` where `K_total` is the total number of error tuples emitted by the decoder and `τ` is a display temperature (`score_temperature` in `chest2err_config.json`). | chest2err-score | K_total | interpretation | |---:|---:|---| | **1.00** | 0 | perfect — no errors detected | | 0.72 | 1 | one error | | 0.51 | 2 | two errors | | 0.37 | 3 | substantial errors | | 0.19 | ≥ 5 | severely degraded | Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].** The temperature `τ` only rescales the displayed number for human readability — a single error no longer collapses the score. Set `τ=1.0` to recover the original `exp(−K_total)` scale (1 → 0.37, 2 → 0.14). Because `exp(−K_total/τ)` is a strictly monotone function of `K_total` for any `τ>0`, the score is **rank-equivalent to `−K_total`**, so all Kendall τ_b benchmarks transfer unchanged from the count form regardless of `τ`. ## Headline metrics Evaluated on the 400-pair `chest2error-bench` gold set: | metric | value | |---|---| | Kendall τ_b vs total errors | +0.665 | | **Kendall τ_b vs Critical errors** (radiologist labels) | **+0.763** | | Kendall τ_b vs severity-weighted errors (radiologist labels) | +0.734 | | **Pairwise within-anchor accuracy** | **0.958** (n=1020) | | Critical-error AUROC | 0.963 | | MAE of K_total | 1.12 | | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) | The τ_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head. For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**. ### CXR/CT generalization | corpus | τ_b vs Critical | |---|---| | ReXVal (CXR, n=200) | +0.682 | | Chest CT (this benchmark, n=400) | **+0.763** | Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT — because it was trained on CT. ## Architecture | component | spec | |---|---| | Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) — fully merged into this repo | | chest2err LoRA | rank 32, α 64, dropout 0.05 — merged into the backbone weights shipped here | | Decoder | 4-layer Transformer, 8 heads, FFN 2048 | | Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) | | Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` | | Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side | The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) − 1`. Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order. ## Files | file | size | purpose | |---|---|---| | `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) | | `config.json` | <1 KB | backbone architecture config | | `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads | | `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) | | `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) | | `chest2err_config.json` | <1 KB | chest2err model meta-config | | `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files | Total: ~1.36 GB. Everything required to run chest2err is in this repository. ## Quick start ```python from chest2err import chest2err_score, chest2err_detail ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion." cand = "[Lungs] Several pulmonary nodules in the left upper lobe." score = chest2err_score(ref, cand) # 0.37 — substantial errors (K_total = 3, τ = 3.0) detail = chest2err_detail(ref, cand) # detail["score"] — chest2err-score in (0, 1] # detail["K_total"] — integer total error count # detail["tuples"] — list of {cat, anat, ref_seg_idx, cand_seg_idx, …} # detail["category_counts"] — per-category breakdown # detail["anatomy_counts"] — per-anatomy breakdown ``` The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed. ## Output schema The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total / τ)` with `τ = 3.0` as above). The score is backed by a sequence of structured error tuples: ```python { "cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison) "anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others) "concept": int, # leaf concept id (clinical finding vocabulary) "ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report "cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report } ``` `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total / τ)` with `τ = 3.0`. ## Training data Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold). ### Variant generation (LLM-injected errors) Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label: - **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison) - **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others) - **target finding concept** (leaf finding from the chest CT vocabulary) Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input. ### Training objective Supervised teacher-forced training on the LLM-labeled error sequences: - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to) Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here. ### Why this works - GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time. - The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763. - Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences. ## Limitations - **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total / τ)` (τ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap. - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference. - **English only.** Trained on English chest CT reports from CT-RATE. - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated. - **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17). - **Single-radiologist gold.** Inter-rater calibration is in progress. ## Citations If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model: ```bibtex @misc{rexval2023, title = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics}, author = {Yu, F. and Endo, M. and Krishnan, R. and others}, year = {2023}, publisher = {PhysioNet}, url = {https://physionet.org/content/rexval-dataset/1.0.0/} } @misc{hamamci2024ctrate, title = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities}, author = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others}, year = {2024}, eprint = {2403.17834}, archivePrefix = {arXiv}, url = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE} } @misc{chest2err2026, title = {chest2err: Sentence-grounded Error Score for Chest CT Reports}, author = {chest2vec contributors}, year = {2026}, url = {https://huggingface.co/chest2vec/chest2err} } ``` ## Related - **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) — the chest2vec encoder this model is built on - **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) — radiologist-labeled 400-pair gold set - **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) — Radiologist-Verified Evaluation, chest X-ray (n=200) - **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) — chest CT volumes + radiology reports corpus ## License CC-BY-NC-4.0. Released for research use.