| --- |
| license: cc-by-nc-4.0 |
| language: |
| - en |
| library_name: pytorch |
| tags: |
| - radiology |
| - chest-ct |
| - report-evaluation |
| - score |
| - medical |
| - rexval |
| datasets: |
| - chest2vec/chest2error-bench |
| base_model: chest2vec/chest2vec_0.6b |
| pipeline_tag: text-classification |
| --- |
| |
| # chest2err β Sentence-grounded Error Score for Chest CT Reports |
|
|
| **chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score β (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors. |
|
|
| The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations. |
|
|
| Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** β no further downloads are required at inference time. |
|
|
| Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience). |
|
|
| ## The chest2err-score |
|
|
| ``` |
| chest2err_score = exp(βK_total / Ο) # Ο = 3.0 (default) |
| ``` |
|
|
| where `K_total` is the total number of error tuples emitted by the decoder and `Ο` is a display temperature (`score_temperature` in `chest2err_config.json`). |
|
|
| | chest2err-score | K_total | interpretation | |
| |---:|---:|---| |
| | **1.00** | 0 | perfect β no errors detected | |
| | 0.72 | 1 | one error | |
| | 0.51 | 2 | two errors | |
| | 0.37 | 3 | substantial errors | |
| | 0.19 | β₯ 5 | severely degraded | |
| |
| Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].** |
| |
| The temperature `Ο` only rescales the displayed number for human readability β a single error no longer collapses the score. Set `Ο=1.0` to recover the original `exp(βK_total)` scale (1 β 0.37, 2 β 0.14). Because `exp(βK_total/Ο)` is a strictly monotone function of `K_total` for any `Ο>0`, the score is **rank-equivalent to `βK_total`**, so all Kendall Ο_b benchmarks transfer unchanged from the count form regardless of `Ο`. |
| |
| ## Headline metrics |
| |
| Evaluated on the 400-pair `chest2error-bench` gold set: |
| |
| | metric | value | |
| |---|---| |
| | Kendall Ο_b vs total errors | +0.665 | |
| | **Kendall Ο_b vs Critical errors** (radiologist labels) | **+0.763** | |
| | Kendall Ο_b vs severity-weighted errors (radiologist labels) | +0.734 | |
| | **Pairwise within-anchor accuracy** | **0.958** (n=1020) | |
| | Critical-error AUROC | 0.963 | |
| | MAE of K_total | 1.12 | |
| | **chest2err-score on GT-S β GT-U equivalence pairs** | **1.00 Β± 0.00** (perfect content-equivalence recognition) | |
|
|
| The Ο_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head. |
|
|
| For comparison on the same benchmark: BLEU Ο_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **β₯ +0.23 Ο_b**. |
| |
| ### CXR/CT generalization |
| |
| | corpus | Ο_b vs Critical | |
| |---|---| |
| | ReXVal (CXR, n=200) | +0.682 | |
| | Chest CT (this benchmark, n=400) | **+0.763** | |
|
|
| Most prior metrics lose 0.4β0.7 Ο_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT β because it was trained on CT. |
| |
| ## Architecture |
| |
| | component | spec | |
| |---|---| |
| | Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) β fully merged into this repo | |
| | chest2err LoRA | rank 32, Ξ± 64, dropout 0.05 β merged into the backbone weights shipped here | |
| | Decoder | 4-layer Transformer, 8 heads, FFN 2048 | |
| | Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) | |
| | Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` | |
| | Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side | |
| |
| The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) β 1`. |
| |
| Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust** (inherits chest2vec's contrastive properties) and the decoder **permutation-invariant** with respect to sentence order. |
| |
| ## Files |
| |
| | file | size | purpose | |
| |---|---|---| |
| | `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) | |
| | `config.json` | <1 KB | backbone architecture config | |
| | `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads | |
| | `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) | |
| | `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) | |
| | `chest2err_config.json` | <1 KB | chest2err model meta-config | |
| | `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files | |
|
|
| Total: ~1.36 GB. Everything required to run chest2err is in this repository. |
|
|
| ## Quick start |
|
|
| ```python |
| from chest2err import chest2err_score, chest2err_detail |
| |
| ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion." |
| cand = "[Lungs] Several pulmonary nodules in the left upper lobe." |
| |
| score = chest2err_score(ref, cand) |
| # 0.37 β substantial errors (K_total = 3, Ο = 3.0) |
| |
| detail = chest2err_detail(ref, cand) |
| # detail["score"] β chest2err-score in (0, 1] |
| # detail["K_total"] β integer total error count |
| # detail["tuples"] β list of {cat, anat, ref_seg_idx, cand_seg_idx, β¦} |
| # detail["category_counts"] β per-category breakdown |
| # detail["anatomy_counts"] β per-anatomy breakdown |
| ``` |
|
|
| The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed. |
|
|
| ## Output schema |
|
|
| The primary output is the **chest2err-score β (0, 1]** (computed from `exp(βK_total / Ο)` with `Ο = 3.0` as above). The score is backed by a sequence of structured error tuples: |
|
|
| ```python |
| { |
| "cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison) |
| "anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others) |
| "concept": int, # leaf concept id (clinical finding vocabulary) |
| "ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report |
| "cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report |
| } |
| ``` |
|
|
| `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) β 1`, and `chest2err_score = exp(βK_total / Ο)` with `Ο = 3.0`. |
|
|
| ## Training data |
|
|
| Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold). |
| |
| ### Variant generation (LLM-injected errors) |
| |
| Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label: |
| |
| - **error category** (1β6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison) |
| - **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others) |
| - **target finding concept** (leaf finding from the chest CT vocabulary) |
|
|
| Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input. |
|
|
| ### Training objective |
|
|
| Supervised teacher-forced training on the LLM-labeled error sequences: |
|
|
| - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step |
| - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to) |
|
|
| Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here. |
| |
| ### Why this works |
| |
| - GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time. |
| - The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with Ο_b vs Critical = +0.763. |
| - Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** β every emitted error tuple cites its source sentences. |
|
|
| ## Limitations |
|
|
| - **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βK_total / Ο)` (Ο = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap. |
| - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference. |
| - **English only.** Trained on English chest CT reports from CT-RATE. |
| - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated. |
| - **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17). |
| - **Single-radiologist gold.** Inter-rater calibration is in progress. |
|
|
| ## Citations |
|
|
| If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model: |
|
|
| ```bibtex |
| @misc{rexval2023, |
| title = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics}, |
| author = {Yu, F. and Endo, M. and Krishnan, R. and others}, |
| year = {2023}, |
| publisher = {PhysioNet}, |
| url = {https://physionet.org/content/rexval-dataset/1.0.0/} |
| } |
| |
| @misc{hamamci2024ctrate, |
| title = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities}, |
| author = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others}, |
| year = {2024}, |
| eprint = {2403.17834}, |
| archivePrefix = {arXiv}, |
| url = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE} |
| } |
| |
| @misc{chest2err2026, |
| title = {chest2err: Sentence-grounded Error Score for Chest CT Reports}, |
| author = {chest2vec contributors}, |
| year = {2026}, |
| url = {https://huggingface.co/chest2vec/chest2err} |
| } |
| ``` |
|
|
| ## Related |
|
|
| - **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) β the chest2vec encoder this model is built on |
| - **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) β radiologist-labeled 400-pair gold set |
| - **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) β Radiologist-Verified Evaluation, chest X-ray (n=200) |
| - **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) β chest CT volumes + radiology reports corpus |
|
|
| ## License |
|
|
| CC-BY-NC-4.0. Released for research use. |
|
|