chest2vec
/

chest2err

@@ -17,14 +17,38 @@ base_model: Qwen/Qwen3-Embedding-0.6B
 pipeline_tag: text-classification
 ---
-# chest2err — Sentence-grounded Error Decoder for Chest CT Reports
-**chest2err** is a sentence-grounded autoregressive decoder that, given a **(reference, candidate)** chest CT report pair, emits a sequence of structured error tuples. Each tuple specifies an error's `(category, anatomy, severity)` and points back at the **specific reference sentence and candidate sentence** that triggered it. The total error count `K` is the length of the emitted sequence.
-Built on top of the [chest2vec](https://huggingface.co/chest2vec) backbone (Qwen3-Embedding-0.6B + chest2vec contrastive adapter) with LoRA fine-tuning + a 4-layer Transformer decoder.
 Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
 ## Headline metrics
 Evaluated on the 400-pair `chest2error-bench` gold set:
@@ -36,7 +60,8 @@ Evaluated on the 400-pair `chest2error-bench` gold set:
 | Kendall τ_b vs severity-weighted | +0.734 |
 | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
 | Critical-error AUROC | 0.963 |
-| MAE vs gold total K | 1.12 |
 For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
@@ -81,40 +106,36 @@ Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust*
 ## Quick start
-Inference requires the cera_eval package (in-tree at [chest2vec_error/src/cera_eval/](https://github.com/...)). A standalone HF-Hub-loadable wrapper is on the roadmap; in the meantime:
 ```python
-import torch
-from huggingface_hub import hf_hub_download
-from safetensors.torch import load_file
-from chest2err_modeling import CADAD  # downloaded from this repo
-# Plus the backbone loader from chest2vec:
-#   pip install transformers peft safetensors
-#   load Qwen/Qwen3-Embedding-0.6B + chest2vec adapter as in chest2vec repo
-# Load weights
-ckpt_path = hf_hub_download("chest2vec/chest2err", "model.safetensors")
-state = load_file(ckpt_path)
-# Wire into your backbone + decoder construction:
-model = CADAD(backbone=chest2vec_backbone, hidden=1024,
-              n_cat=5, n_anat=9, n_concepts=concept_vocab_size,
-              decoder_layers=4, decoder_heads=8, decoder_ff=2048,
-              max_decode_steps=24)
-model.load_state_dict(state, strict=False)
-model.eval()
-# At inference, encode (ref, cand), build sentence segment masks,
-# then call model.generate(...) which returns a list of tuples.
-# K = len(tuples) - 1 (EOS).
 ```
-A complete inference example (with sentence segmentation + tokenization) lives in [chest2vec_error/src/cera_eval/scorer.py](https://github.com/...).
 ## Output schema
-Each generated tuple is:
 ```python
 {
@@ -127,7 +148,7 @@ Each generated tuple is:
 }
 ```
-`cat == 0` is the EOS marker; the model stops when it emits it.
 ## Training data

 pipeline_tag: text-classification
 ---
+# chest2err — Sentence-grounded Error Score for Chest CT Reports
+**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.37 means one critical error; below 0.05 means severely degraded.
+The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy, severity)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
+Built on the [chest2vec](https://huggingface.co/chest2vec) backbone (Qwen3-Embedding-0.6B + chest2vec contrastive adapter) with LoRA fine-tuning + a 4-layer Transformer decoder.
 Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
+## The chest2err-score
+```
+chest2err_score = exp(−K_w)
+K_w             = K_critical + 0.25 × K_minor
+```
+where `K_critical` and `K_minor` are the counts of Critical and Minor errors emitted by the decoder.
+| chest2err-score | K_w | interpretation |
+|---:|---:|---|
+| **1.00** | 0 | perfect — no errors |
+| 0.78 | 0.25 | one Minor error |
+| 0.37 | 1 | one Critical error |
+| 0.14 | 2 | two Critical (or 1 Critical + 4 Minor) |
+| 0.05 | 3 | substantial errors |
+| < 0.01 | ≥ 5 | severely degraded |
+Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
+The score is rank-equivalent to `−K_w`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
 ## Headline metrics
 Evaluated on the 400-pair `chest2error-bench` gold set:
 | Kendall τ_b vs severity-weighted | +0.734 |
 | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
 | Critical-error AUROC | 0.963 |
+| MAE of K_total | 1.12 |
+| **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |
 For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
 ## Quick start
+```python
+from chest2err import chest2err_score   # in-tree convenience wrapper
+ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
+cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
+score = chest2err_score(ref, cand)
+# 0.05 — substantial errors (1 false_prediction Critical + 1 omission Minor)
+```
+For the structured tuple output (which sentence triggered which error, plus the underlying K):
 ```python
+from chest2err import chest2err_detail
+detail = chest2err_detail(ref, cand)
+# detail.score           — chest2err-score in (0, 1]
+# detail.K_total         — integer total error count
+# detail.K_critical      — Critical error count
+# detail.K_minor         — Minor error count
+# detail.tuples          — list of {cat, anat, severity, ref_seg_idx, cand_seg_idx}
+# detail.category_counts — per-category breakdown
+# detail.anatomy_counts  — per-anatomy breakdown
 ```
+A self-contained HF `from_pretrained` loader is on the roadmap. Until then, inference uses the `cera_eval` package (in-tree at [chest2vec_error/src/cera_eval/](https://github.com/...)).
 ## Output schema
+The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_w)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
 ```python
 {
 }
 ```
+`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`. Then `K_critical = sum(severity == 1)`, `K_minor = sum(severity == 0)`, and `score = exp(−(K_critical + 0.25 × K_minor))`.
 ## Training data