chest2vec
/

chest2err

@@ -30,24 +30,24 @@ Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datas
 ## The chest2err-score
 ```
-chest2err_score = exp(−K_w)
-K_w             = K_critical + 0.25 × K_minor
 ```
-where `K_critical` and `K_minor` are the counts of Critical and Minor errors emitted by the decoder.
-| chest2err-score | K_w | interpretation |
 |---:|---:|---|
-| **1.00** | 0 | perfect — no errors |
-| 0.78 | 0.25 | one Minor error |
-| 0.37 | 1 | one Critical error |
-| 0.14 | 2 | two Critical (or 1 Critical + 4 Minor) |
 | 0.05 | 3 | substantial errors |
 | < 0.01 | ≥ 5 | severely degraded |
 Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
-The score is rank-equivalent to `−K_w`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
 ## Headline metrics
@@ -55,14 +55,16 @@ Evaluated on the 400-pair `chest2error-bench` gold set:
 | metric | value |
 |---|---|
-| **Kendall τ_b vs Critical errors** | **+0.763** |
 | Kendall τ_b vs total errors | +0.665 |
 | Kendall τ_b vs severity-weighted | +0.734 |
 | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
 | Critical-error AUROC | 0.963 |
 | MAE of K_total | 1.12 |
 | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |
 For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
 ### CXR/CT generalization
@@ -135,20 +137,20 @@ A self-contained HF `from_pretrained` loader is on the roadmap. Until then, infe
 ## Output schema
-The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_w)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
 ```python
 {
     "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
     "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
     "concept":      int,  # leaf concept id (clinical finding vocabulary)
-    "severity":     int,  # 0 = Minor, 1 = Critical
     "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
     "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
 }
 ```
-`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`. Then `K_critical = sum(severity == 1)`, `K_minor = sum(severity == 0)`, and `score = exp(−(K_critical + 0.25 × K_minor))`.
 ## Training data
@@ -170,7 +172,8 @@ Supervised teacher-forced training on the LLM-labeled error sequences:
 - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
 - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
-- **Severity loss** on the `severity` head (Critical / Minor — added on the radiologist-labeled validation subset)
 Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
@@ -182,6 +185,7 @@ Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the
 ## Limitations
 - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
 - **English only.** Trained on English chest CT reports from CT-RATE.
 - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.

 ## The chest2err-score
 ```
+chest2err_score = exp(−K_total)
 ```
+where `K_total` is the total number of error tuples emitted by the decoder.
+| chest2err-score | K_total | interpretation |
 |---:|---:|---|
+| **1.00** | 0 | perfect — no errors detected |
+| 0.37 | 1 | one error |
+| 0.14 | 2 | two errors |
 | 0.05 | 3 | substantial errors |
 | < 0.01 | ≥ 5 | severely degraded |
 Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
+The score is rank-equivalent to `−K_total`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
+> **Note on severity weighting.** The decoder also emits a `severity ∈ {Minor, Critical}` field per error tuple. However, the LLM-generated training corpus does **not** include severity labels — only the 200-variant radiologist-labeled validation slice does — so the severity head is **not currently reliably trained**. Until a severity-labeled training set is released, the canonical chest2err-score uses **`K_total` directly** (every emitted error weighted equally). A severity-weighted variant of the form `K_w = K_critical + 0.25 × K_minor` will become the recommended formulation once the severity head is properly fine-tuned.
 ## Headline metrics
 | metric | value |
 |---|---|
 | Kendall τ_b vs total errors | +0.665 |
+| **Kendall τ_b vs Critical errors** | **+0.763** |
 | Kendall τ_b vs severity-weighted | +0.734 |
 | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
 | Critical-error AUROC | 0.963 |
 | MAE of K_total | 1.12 |
 | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |
+The Critical and severity-weighted τ_b numbers are computed using the **radiologist's severity labels** in the gold set (not the model's severity output). They show that the predicted K_total correlates strongly with the human Critical-error count even without explicit severity supervision — once a severity-labeled training corpus is added, these numbers should improve further.
 For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
 ### CXR/CT generalization
 ## Output schema
+The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
 ```python
 {
     "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
     "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
     "concept":      int,  # leaf concept id (clinical finding vocabulary)
+    "severity":     int,  # 0 = Minor, 1 = Critical (not reliably trained in v0.1 — see severity-weighting note above)
     "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
     "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
 }
 ```
+`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total)`.
 ## Training data
 - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
 - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
+Note: a `severity` head exists in the architecture but is **not reliably trained in v0.1** — GPT-4o-mini's variant labels don't include Critical/Minor severity, and the 200-row radiologist subset is too small a signal on its own. Severity output is therefore not part of the canonical chest2err-score in this release. Adding a severity-labeled training set is the headline item on the roadmap.
 Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
 ## Limitations
+- **Severity output not reliable in v0.1.** The decoder emits a Critical / Minor severity per error tuple, but its training signal is too thin (GPT-4o-mini's variant labels don't include severity). Use the canonical `chest2err_score = exp(−K_total)` and ignore the severity field until a severity-labeled training set is released.
 - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
 - **English only.** Trained on English chest CT reports from CT-RATE.
 - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.