lukeingawesome commited on
Commit
06de0a9
Β·
verified Β·
1 Parent(s): 23c824a

Soften chest2err-score: add display temperature tau=3.0 (default)

Browse files

exp(-K_total/tau) with tau=3.0; one error -> 0.72 instead of 0.37. Rank-equivalent, so Kendall tau_b benchmarks unchanged. tau=1.0 recovers original scale.

Files changed (3) hide show
  1. README.md +12 -12
  2. chest2err.py +5 -1
  3. chest2err_config.json +1 -0
README.md CHANGED
@@ -18,7 +18,7 @@ pipeline_tag: text-classification
18
 
19
  # chest2err β€” Sentence-grounded Error Score for Chest CT Reports
20
 
21
- **chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.37 means one error; below 0.05 means substantial errors.
22
 
23
  The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
24
 
@@ -29,22 +29,22 @@ Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datas
29
  ## The chest2err-score
30
 
31
  ```
32
- chest2err_score = exp(βˆ’K_total)
33
  ```
34
 
35
- where `K_total` is the total number of error tuples emitted by the decoder.
36
 
37
  | chest2err-score | K_total | interpretation |
38
  |---:|---:|---|
39
  | **1.00** | 0 | perfect β€” no errors detected |
40
- | 0.37 | 1 | one error |
41
- | 0.14 | 2 | two errors |
42
- | 0.05 | 3 | substantial errors |
43
- | < 0.01 | β‰₯ 5 | severely degraded |
44
 
45
  Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
46
 
47
- The score is rank-equivalent to `βˆ’K_total`, so all Kendall Ο„_b benchmarks transfer unchanged from the count form.
48
 
49
  ## Headline metrics
50
 
@@ -111,7 +111,7 @@ ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
111
  cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
112
 
113
  score = chest2err_score(ref, cand)
114
- # 0.05 β€” substantial errors
115
 
116
  detail = chest2err_detail(ref, cand)
117
  # detail["score"] β€” chest2err-score in (0, 1]
@@ -125,7 +125,7 @@ The loader picks up the bundled weights automatically; no extra setup beyond `pi
125
 
126
  ## Output schema
127
 
128
- The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’K_total)` as above). The score is backed by a sequence of structured error tuples:
129
 
130
  ```python
131
  {
@@ -137,7 +137,7 @@ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’
137
  }
138
  ```
139
 
140
- `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) βˆ’ 1`, and `chest2err_score = exp(βˆ’K_total)`.
141
 
142
  ## Training data
143
 
@@ -170,7 +170,7 @@ Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive
170
 
171
  ## Limitations
172
 
173
- - **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βˆ’K_total)` treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
174
  - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
175
  - **English only.** Trained on English chest CT reports from CT-RATE.
176
  - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
 
18
 
19
  # chest2err β€” Sentence-grounded Error Score for Chest CT Reports
20
 
21
+ **chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.
22
 
23
  The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
24
 
 
29
  ## The chest2err-score
30
 
31
  ```
32
+ chest2err_score = exp(βˆ’K_total / Ο„) # Ο„ = 3.0 (default)
33
  ```
34
 
35
+ where `K_total` is the total number of error tuples emitted by the decoder and `Ο„` is a display temperature (`score_temperature` in `chest2err_config.json`).
36
 
37
  | chest2err-score | K_total | interpretation |
38
  |---:|---:|---|
39
  | **1.00** | 0 | perfect β€” no errors detected |
40
+ | 0.72 | 1 | one error |
41
+ | 0.51 | 2 | two errors |
42
+ | 0.37 | 3 | substantial errors |
43
+ | 0.19 | β‰₯ 5 | severely degraded |
44
 
45
  Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
46
 
47
+ The temperature `Ο„` only rescales the displayed number for human readability β€” a single error no longer collapses the score. Set `Ο„=1.0` to recover the original `exp(βˆ’K_total)` scale (1 β†’ 0.37, 2 β†’ 0.14). Because `exp(βˆ’K_total/Ο„)` is a strictly monotone function of `K_total` for any `Ο„>0`, the score is **rank-equivalent to `βˆ’K_total`**, so all Kendall Ο„_b benchmarks transfer unchanged from the count form regardless of `Ο„`.
48
 
49
  ## Headline metrics
50
 
 
111
  cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
112
 
113
  score = chest2err_score(ref, cand)
114
+ # 0.37 β€” substantial errors (K_total = 3, Ο„ = 3.0)
115
 
116
  detail = chest2err_detail(ref, cand)
117
  # detail["score"] β€” chest2err-score in (0, 1]
 
125
 
126
  ## Output schema
127
 
128
+ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(βˆ’K_total / Ο„)` with `Ο„ = 3.0` as above). The score is backed by a sequence of structured error tuples:
129
 
130
  ```python
131
  {
 
137
  }
138
  ```
139
 
140
+ `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) βˆ’ 1`, and `chest2err_score = exp(βˆ’K_total / Ο„)` with `Ο„ = 3.0`.
141
 
142
  ## Training data
143
 
 
170
 
171
  ## Limitations
172
 
173
+ - **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(βˆ’K_total / Ο„)` (Ο„ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
174
  - **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
175
  - **English only.** Trained on English chest CT reports from CT-RATE.
176
  - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
chest2err.py CHANGED
@@ -50,6 +50,10 @@ class Chest2Err:
50
  self.cfg = cfg
51
  self.device = device
52
  self.max_length = cfg["max_length"]
 
 
 
 
53
 
54
  # Concept vocab (size determines decoder output head dim)
55
  with open(PACKAGE_DIR / "concept2id.json") as f:
@@ -114,7 +118,7 @@ class Chest2Err:
114
  )
115
  seq = seqs[0]
116
  K_total = len(seq)
117
- score = math.exp(-K_total)
118
 
119
  cat_counts = [0] * self.cfg["n_cat"]
120
  anat_counts = [0] * self.cfg["n_anat"]
 
50
  self.cfg = cfg
51
  self.device = device
52
  self.max_length = cfg["max_length"]
53
+ # Display temperature Ο„ for the score exp(-K_total/Ο„). Ο„=3.0 is the
54
+ # default gentle setting (one error β†’ 0.72); Ο„=1.0 reproduces the
55
+ # original exp(-K_total). Rank-equivalent, so Ο„ never affects Ο„_b.
56
+ self.score_temperature = float(cfg.get("score_temperature", 3.0))
57
 
58
  # Concept vocab (size determines decoder output head dim)
59
  with open(PACKAGE_DIR / "concept2id.json") as f:
 
118
  )
119
  seq = seqs[0]
120
  K_total = len(seq)
121
+ score = math.exp(-K_total / self.score_temperature)
122
 
123
  cat_counts = [0] * self.cfg["n_cat"]
124
  anat_counts = [0] * self.cfg["n_anat"]
chest2err_config.json CHANGED
@@ -11,5 +11,6 @@
11
  "decoder_ff": 2048,
12
  "decoder_dropout": 0.1,
13
  "max_decode_steps": 24,
 
14
  "input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
15
  }
 
11
  "decoder_ff": 2048,
12
  "decoder_dropout": 0.1,
13
  "max_decode_steps": 24,
14
+ "score_temperature": 3.0,
15
  "input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
16
  }