lukeingawesome commited on
Commit
b7103ad
·
verified ·
1 Parent(s): 8a9746d

Reframe output as chest2err-score (exp(-K_w)) — GREEN-style single-number quality signal in (0,1]

Browse files
Files changed (1) hide show
  1. README.md +53 -32
README.md CHANGED
@@ -17,14 +17,38 @@ base_model: Qwen/Qwen3-Embedding-0.6B
17
  pipeline_tag: text-classification
18
  ---
19
 
20
- # chest2err — Sentence-grounded Error Decoder for Chest CT Reports
21
 
22
- **chest2err** is a sentence-grounded autoregressive decoder that, given a **(reference, candidate)** chest CT report pair, emits a sequence of structured error tuples. Each tuple specifies an error's `(category, anatomy, severity)` and points back at the **specific reference sentence and candidate sentence** that triggered it. The total error count `K` is the length of the emitted sequence.
23
 
24
- Built on top of the [chest2vec](https://huggingface.co/chest2vec) backbone (Qwen3-Embedding-0.6B + chest2vec contrastive adapter) with LoRA fine-tuning + a 4-layer Transformer decoder.
 
 
25
 
26
  Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Headline metrics
29
 
30
  Evaluated on the 400-pair `chest2error-bench` gold set:
@@ -36,7 +60,8 @@ Evaluated on the 400-pair `chest2error-bench` gold set:
36
  | Kendall τ_b vs severity-weighted | +0.734 |
37
  | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
38
  | Critical-error AUROC | 0.963 |
39
- | MAE vs gold total K | 1.12 |
 
40
 
41
  For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
42
 
@@ -81,40 +106,36 @@ Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust*
81
 
82
  ## Quick start
83
 
84
- Inference requires the cera_eval package (in-tree at [chest2vec_error/src/cera_eval/](https://github.com/...)). A standalone HF-Hub-loadable wrapper is on the roadmap; in the meantime:
 
 
 
 
 
 
 
 
 
 
85
 
86
  ```python
87
- import torch
88
- from huggingface_hub import hf_hub_download
89
- from safetensors.torch import load_file
90
-
91
- from chest2err_modeling import CADAD # downloaded from this repo
92
- # Plus the backbone loader from chest2vec:
93
- # pip install transformers peft safetensors
94
- # load Qwen/Qwen3-Embedding-0.6B + chest2vec adapter as in chest2vec repo
95
-
96
- # Load weights
97
- ckpt_path = hf_hub_download("chest2vec/chest2err", "model.safetensors")
98
- state = load_file(ckpt_path)
99
-
100
- # Wire into your backbone + decoder construction:
101
- model = CADAD(backbone=chest2vec_backbone, hidden=1024,
102
- n_cat=5, n_anat=9, n_concepts=concept_vocab_size,
103
- decoder_layers=4, decoder_heads=8, decoder_ff=2048,
104
- max_decode_steps=24)
105
- model.load_state_dict(state, strict=False)
106
- model.eval()
107
-
108
- # At inference, encode (ref, cand), build sentence segment masks,
109
- # then call model.generate(...) which returns a list of tuples.
110
- # K = len(tuples) - 1 (EOS).
111
  ```
112
 
113
- A complete inference example (with sentence segmentation + tokenization) lives in [chest2vec_error/src/cera_eval/scorer.py](https://github.com/...).
114
 
115
  ## Output schema
116
 
117
- Each generated tuple is:
118
 
119
  ```python
120
  {
@@ -127,7 +148,7 @@ Each generated tuple is:
127
  }
128
  ```
129
 
130
- `cat == 0` is the EOS marker; the model stops when it emits it.
131
 
132
  ## Training data
133
 
 
17
  pipeline_tag: text-classification
18
  ---
19
 
20
+ # chest2err — Sentence-grounded Error Score for Chest CT Reports
21
 
22
+ **chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.37 means one critical error; below 0.05 means severely degraded.
23
 
24
+ The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy, severity)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
25
+
26
+ Built on the [chest2vec](https://huggingface.co/chest2vec) backbone (Qwen3-Embedding-0.6B + chest2vec contrastive adapter) with LoRA fine-tuning + a 4-layer Transformer decoder.
27
 
28
  Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
29
 
30
+ ## The chest2err-score
31
+
32
+ ```
33
+ chest2err_score = exp(−K_w)
34
+ K_w = K_critical + 0.25 × K_minor
35
+ ```
36
+
37
+ where `K_critical` and `K_minor` are the counts of Critical and Minor errors emitted by the decoder.
38
+
39
+ | chest2err-score | K_w | interpretation |
40
+ |---:|---:|---|
41
+ | **1.00** | 0 | perfect — no errors |
42
+ | 0.78 | 0.25 | one Minor error |
43
+ | 0.37 | 1 | one Critical error |
44
+ | 0.14 | 2 | two Critical (or 1 Critical + 4 Minor) |
45
+ | 0.05 | 3 | substantial errors |
46
+ | < 0.01 | ≥ 5 | severely degraded |
47
+
48
+ Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].**
49
+
50
+ The score is rank-equivalent to `−K_w`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
51
+
52
  ## Headline metrics
53
 
54
  Evaluated on the 400-pair `chest2error-bench` gold set:
 
60
  | Kendall τ_b vs severity-weighted | +0.734 |
61
  | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
62
  | Critical-error AUROC | 0.963 |
63
+ | MAE of K_total | 1.12 |
64
+ | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |
65
 
66
  For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
67
 
 
106
 
107
  ## Quick start
108
 
109
+ ```python
110
+ from chest2err import chest2err_score # in-tree convenience wrapper
111
+
112
+ ref = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
113
+ cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
114
+
115
+ score = chest2err_score(ref, cand)
116
+ # 0.05 — substantial errors (1 false_prediction Critical + 1 omission Minor)
117
+ ```
118
+
119
+ For the structured tuple output (which sentence triggered which error, plus the underlying K):
120
 
121
  ```python
122
+ from chest2err import chest2err_detail
123
+
124
+ detail = chest2err_detail(ref, cand)
125
+ # detail.score — chest2err-score in (0, 1]
126
+ # detail.K_total — integer total error count
127
+ # detail.K_critical — Critical error count
128
+ # detail.K_minor — Minor error count
129
+ # detail.tuples — list of {cat, anat, severity, ref_seg_idx, cand_seg_idx}
130
+ # detail.category_counts — per-category breakdown
131
+ # detail.anatomy_counts — per-anatomy breakdown
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  ```
133
 
134
+ A self-contained HF `from_pretrained` loader is on the roadmap. Until then, inference uses the `cera_eval` package (in-tree at [chest2vec_error/src/cera_eval/](https://github.com/...)).
135
 
136
  ## Output schema
137
 
138
+ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_w)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
139
 
140
  ```python
141
  {
 
148
  }
149
  ```
150
 
151
+ `cat == 0` is the EOS marker; the model stops when it emits it. `K_total = len(tuples) − 1`. Then `K_critical = sum(severity == 1)`, `K_minor = sum(severity == 0)`, and `score = exp(−(K_critical + 0.25 × K_minor))`.
152
 
153
  ## Training data
154