Soften chest2err-score: add display temperature tau=3.0 (default)

exp(-K_total/tau) with tau=3.0; one error -> 0.72 instead of 0.37. Rank-equivalent, so Kendall tau_b benchmarks unchanged. tau=1.0 recovers original scale.

Files changed (3) hide show

README.md +12 -12
chest2err.py +5 -1
chest2err_config.json +1 -0

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ pipeline_tag: text-classification
 # chest2err — Sentence-grounded Error Score for Chest CT Reports
-**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.37 means one error; below 0.05 means substantial errors.
 The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
@@ -29,22 +29,22 @@ Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datas
 ## The chest2err-score
 ```
-chest2err_score = exp(−K_total)
 ```
-where `K_total` is the total number of error tuples emitted by the decoder.
 | chest2err-score | K_total | interpretation |
 |---:|---:|---|
 | **1.00** | 0 | perfect — no errors detected |
-| 0.37 | 1 | one error |
-| 0.14 | 2 | two errors |
-| 0.05 | 3 | substantial errors |
-| < 0.01 | ≥ 5 | severely degraded |
 Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
-The score is rank-equivalent to `−K_total`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
 ## Headline metrics
@@ -111,7 +111,7 @@ ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
 cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
 score = chest2err_score(ref, cand)
-# 0.05 — substantial errors
 detail = chest2err_detail(ref, cand)
 # detail["score"]           — chest2err-score in (0, 1]
@@ -125,7 +125,7 @@ The loader picks up the bundled weights automatically; no extra setup beyond `pi
 ## Output schema
-The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total)` as above). The score is backed by a sequence of structured error tuples:
 ```python
 {
@@ -137,7 +137,7 @@ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−
 }
 ```
-`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total)`.
 ## Training data
@@ -170,7 +170,7 @@ Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive
 ## Limitations
-- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total)` treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
 - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
 - **English only.** Trained on English chest CT reports from CT-RATE.
 - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.

 # chest2err — Sentence-grounded Error Score for Chest CT Reports
+**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.
 The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
 ## The chest2err-score
 ```
+chest2err_score = exp(−K_total / τ)        # τ = 3.0 (default)
 ```
+where `K_total` is the total number of error tuples emitted by the decoder and `τ` is a display temperature (`score_temperature` in `chest2err_config.json`).
 | chest2err-score | K_total | interpretation |
 |---:|---:|---|
 | **1.00** | 0 | perfect — no errors detected |
+| 0.72 | 1 | one error |
+| 0.51 | 2 | two errors |
+| 0.37 | 3 | substantial errors |
+| 0.19 | ≥ 5 | severely degraded |
 Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
+The temperature `τ` only rescales the displayed number for human readability — a single error no longer collapses the score. Set `τ=1.0` to recover the original `exp(−K_total)` scale (1 → 0.37, 2 → 0.14). Because `exp(−K_total/τ)` is a strictly monotone function of `K_total` for any `τ>0`, the score is **rank-equivalent to `−K_total`**, so all Kendall τ_b benchmarks transfer unchanged from the count form regardless of `τ`.
 ## Headline metrics
 cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
 score = chest2err_score(ref, cand)
+# 0.37 — substantial errors (K_total = 3, τ = 3.0)
 detail = chest2err_detail(ref, cand)
 # detail["score"]           — chest2err-score in (0, 1]
 ## Output schema
+The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total / τ)` with `τ = 3.0` as above). The score is backed by a sequence of structured error tuples:
 ```python
 {
 }
 ```
+`cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`, and `chest2err_score = exp(−K_total / τ)` with `τ = 3.0`.
 ## Training data
 ## Limitations
+- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total / τ)` (τ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
 - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
 - **English only.** Trained on English chest CT reports from CT-RATE.
 - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.

chest2err.py CHANGED Viewed

@@ -50,6 +50,10 @@ class Chest2Err:
         self.cfg = cfg
         self.device = device
         self.max_length = cfg["max_length"]
         # Concept vocab (size determines decoder output head dim)
         with open(PACKAGE_DIR / "concept2id.json") as f:
@@ -114,7 +118,7 @@ class Chest2Err:
             )
         seq = seqs[0]
         K_total = len(seq)
-        score = math.exp(-K_total)
         cat_counts = [0] * self.cfg["n_cat"]
         anat_counts = [0] * self.cfg["n_anat"]

         self.cfg = cfg
         self.device = device
         self.max_length = cfg["max_length"]
+        # Display temperature τ for the score exp(-K_total/τ). τ=3.0 is the
+        # default gentle setting (one error → 0.72); τ=1.0 reproduces the
+        # original exp(-K_total). Rank-equivalent, so τ never affects τ_b.
+        self.score_temperature = float(cfg.get("score_temperature", 3.0))
         # Concept vocab (size determines decoder output head dim)
         with open(PACKAGE_DIR / "concept2id.json") as f:
             )
         seq = seqs[0]
         K_total = len(seq)
+        score = math.exp(-K_total / self.score_temperature)
         cat_counts = [0] * self.cfg["n_cat"]
         anat_counts = [0] * self.cfg["n_anat"]

chest2err_config.json CHANGED Viewed

@@ -11,5 +11,6 @@
   "decoder_ff": 2048,
   "decoder_dropout": 0.1,
   "max_decode_steps": 24,
   "input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
 }

   "decoder_ff": 2048,
   "decoder_dropout": 0.1,
   "max_decode_steps": 24,
+  "score_temperature": 3.0,
   "input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
 }