Soften chest2err-score: add display temperature tau=3.0 (default)

06de0a9 verified 3 days ago

12.4 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	library_name: pytorch
	tags:
	- radiology
	- chest-ct
	- report-evaluation
	- score
	- medical
	- rexval
	datasets:
	- chest2vec/chest2error-bench
	base_model: chest2vec/chest2vec_0.6b
	pipeline_tag: text-classification
	---

	# chest2err — Sentence-grounded Error Score for Chest CT Reports

	chest2err is a sentence-grounded autoregressive evaluator that, given a (reference, candidate) chest CT report pair, outputs a single chest2err-score ∈ (0, 1] where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.

	The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the specific reference sentence and candidate sentence that triggered it, so the score comes with built-in explanations.

	Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. All backbone and decoder weights are bundled in this repository — no further downloads are required at inference time.

	Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).

	## The chest2err-score

	```
	chest2err_score = exp(−K_total / τ) # τ = 3.0 (default)
	```

	where `K_total` is the total number of error tuples emitted by the decoder and `τ` is a display temperature (`score_temperature` in `chest2err_config.json`).

	\| chest2err-score \| K_total \| interpretation \|
	\|---:\|---:\|---\|
	\| 1.00 \| 0 \| perfect — no errors detected \|
	\| 0.72 \| 1 \| one error \|
	\| 0.51 \| 2 \| two errors \|
	\| 0.37 \| 3 \| substantial errors \|
	\| 0.19 \| ≥ 5 \| severely degraded \|

	Higher = better. Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].

	The temperature `τ` only rescales the displayed number for human readability — a single error no longer collapses the score. Set `τ=1.0` to recover the original `exp(−K_total)` scale (1 → 0.37, 2 → 0.14). Because `exp(−K_total/τ)` is a strictly monotone function of `K_total` for any `τ>0`, the score is rank-equivalent to `−K_total`, so all Kendall τ_b benchmarks transfer unchanged from the count form regardless of `τ`.

	## Headline metrics

	Evaluated on the 400-pair `chest2error-bench` gold set:

	\| metric \| value \|
	\|---\|---\|
	\| Kendall τ_b vs total errors \| +0.665 \|
	\| Kendall τ_b vs Critical errors (radiologist labels) \| +0.763 \|
	\| Kendall τ_b vs severity-weighted errors (radiologist labels) \| +0.734 \|
	\| Pairwise within-anchor accuracy \| 0.958 (n=1020) \|
	\| Critical-error AUROC \| 0.963 \|
	\| MAE of K_total \| 1.12 \|
	\| chest2err-score on GT-S ↔ GT-U equivalence pairs \| 1.00 ± 0.00 (perfect content-equivalence recognition) \|

	The τ_b numbers against Critical / severity-weighted errors use the radiologist's severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.

	For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by ≥ +0.23 τ_b.

	### CXR/CT generalization

	\| corpus \| τ_b vs Critical \|
	\|---\|---\|
	\| ReXVal (CXR, n=200) \| +0.682 \|
	\| Chest CT (this benchmark, n=400) \| +0.763 \|

	Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that gains on CT — because it was trained on CT.

	## Architecture

	\| component \| spec \|
	\|---\|---\|
	\| Backbone \| [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) — fully merged into this repo \|
	\| chest2err LoRA \| rank 32, α 64, dropout 0.05 — merged into the backbone weights shipped here \|
	\| Decoder \| 4-layer Transformer, 8 heads, FFN 2048 \|
	\| Max decode steps \| 24 (hard cap; suffices for max-K=17 observed in radiologist gold) \|
	\| Output tuple \| `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` \|
	\| Pooling \| mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side \|

	The decoder is cross-attended over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) − 1`.

	Mean-pooling sentences before the decoder makes the encoder paraphrase-robust (inherits chest2vec's contrastive properties) and the decoder permutation-invariant with respect to sentence order.

	## Files

	\| file \| size \| purpose \|
	\|---\|---\|---\|
	\| `model.safetensors` \| ~1.1 GB \| merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) \|
	\| `config.json` \| <1 KB \| backbone architecture config \|
	\| `decoder.safetensors` \| ~207 MB \| decoder + null embeddings + heads \|
	\| `chest2err_modeling.py` \| 14 KB \| decoder architecture (the `CADAD` class) \|
	\| `chest2err.py` \| 6 KB \| self-contained loader (`chest2err_score`, `chest2err_detail`) \|
	\| `chest2err_config.json` \| <1 KB \| chest2err model meta-config \|
	\| `tokenizer.json`, `vocab.json`, etc. \| ~14 MB \| tokenizer files \|

	Total: ~1.36 GB. Everything required to run chest2err is in this repository.

	## Quick start

	```python
	from chest2err import chest2err_score, chest2err_detail

	ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
	cand = "[Lungs] Several pulmonary nodules in the left upper lobe."

	score = chest2err_score(ref, cand)
	# 0.37 — substantial errors (K_total = 3, τ = 3.0)

	detail = chest2err_detail(ref, cand)
	# detail["score"] — chest2err-score in (0, 1]
	# detail["K_total"] — integer total error count
	# detail["tuples"] — list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
	# detail["category_counts"] — per-category breakdown
	# detail["anatomy_counts"] — per-anatomy breakdown
	```

	The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.

	## Output schema

	The primary output is the chest2err-score ∈ (0, 1] (computed from `exp(−K_total / τ)` with `τ = 3.0` as above). The score is backed by a sequence of structured error tuples:

	```python
	{
	"cat": int, # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
	"anat": int, # 0..8 (Lungs & Airways, Pleura, ... Others)
	"concept": int, # leaf concept id (clinical finding vocabulary)
	"ref_seg_idx": int, # -1 = NULL_REF, otherwise sentence index in reference report
	"cand_seg_idx": int, # -1 = NULL_CAND, otherwise sentence index in candidate report
	}
	```

	`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total / τ)` with `τ = 3.0`.

	## Training data

	Trained on `chest2vec/chest2err-train` (in preparation): 53,881 (reference, candidate, labeled_errors) triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).

	### Variant generation (LLM-injected errors)

	Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted GPT-4o-mini to produce four candidate variants that deliberately insert a controlled number of errors drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:

	- error category (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
	- anatomy section (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
	- target finding concept (leaf finding from the chest CT vocabulary)

	Each training example is therefore a (reference, candidate, [per-error (category, anatomy, concept) triples]) record. The model is supervised to reproduce this structured error trace given only the (reference, candidate) input.

	### Training objective

	Supervised teacher-forced training on the LLM-labeled error sequences:

	- Per-step token losses on `(category, anatomy, concept)` heads at each decoder step
	- Pointer losses on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)

	Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.

	### Why this works

	- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us noiseless K at training time.
	- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to human-labeled errors at deployment with τ_b vs Critical = +0.763.
	- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model interpretable — every emitted error tuple cites its source sentences.

	## Limitations

	- No severity output in v0.1. The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total / τ)` (τ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
	- Reference dependence. chest2err is a paired metric. It cannot evaluate a candidate against no reference.
	- English only. Trained on English chest CT reports from CT-RATE.
	- Chest CT only. Cross-domain performance (e.g. abdominal CT) is not validated.
	- 24-error hard cap. Reports with > 24 errors are clipped (rare; max observed in gold = 17).
	- Single-radiologist gold. Inter-rater calibration is in progress.

	## Citations

	If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:

	```bibtex
	@misc{rexval2023,
	title = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
	author = {Yu, F. and Endo, M. and Krishnan, R. and others},
	year = {2023},
	publisher = {PhysioNet},
	url = {https://physionet.org/content/rexval-dataset/1.0.0/}
	}

	@misc{hamamci2024ctrate,
	title = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
	author = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
	year = {2024},
	eprint = {2403.17834},
	archivePrefix = {arXiv},
	url = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
	}

	@misc{chest2err2026,
	title = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
	author = {chest2vec contributors},
	year = {2026},
	url = {https://huggingface.co/chest2vec/chest2err}
	}
	```

	## Related

	- Backbone: [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) — the chest2vec encoder this model is built on
	- Eval benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) — radiologist-labeled 400-pair gold set
	- CXR analogue (taxonomy basis): [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) — Radiologist-Verified Evaluation, chest X-ray (n=200)
	- Source of reference reports: [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) — chest CT volumes + radiology reports corpus

	## License

	CC-BY-NC-4.0. Released for research use.