Add training-procedure section: GPT-4o-mini variant generation with anatomy + category labels
Browse files
README.md
CHANGED
|
@@ -152,9 +152,33 @@ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−
|
|
| 152 |
|
| 153 |
## Training data
|
| 154 |
|
| 155 |
-
Trained on `chest2vec/chest2err-train` (in preparation): 53,881 (reference, candidate)
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
## Limitations
|
| 160 |
|
|
|
|
| 152 |
|
| 153 |
## Training data
|
| 154 |
|
| 155 |
+
Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).
|
| 156 |
|
| 157 |
+
### Variant generation (LLM-injected errors)
|
| 158 |
+
|
| 159 |
+
Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:
|
| 160 |
+
|
| 161 |
+
- **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
|
| 162 |
+
- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
|
| 163 |
+
- **target finding concept** (leaf finding from the chest CT vocabulary)
|
| 164 |
+
|
| 165 |
+
Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model trains to *reproduce* this structured error trace given only the (reference, candidate) input.
|
| 166 |
+
|
| 167 |
+
### Training objective
|
| 168 |
+
|
| 169 |
+
Supervised teacher-forced training on the LLM-labeled error sequences:
|
| 170 |
+
|
| 171 |
+
- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
|
| 172 |
+
- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
|
| 173 |
+
- **Severity loss** on the `severity` head (Critical / Minor — added on the radiologist-labeled validation subset)
|
| 174 |
+
|
| 175 |
+
Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
|
| 176 |
+
|
| 177 |
+
### Why this works
|
| 178 |
+
|
| 179 |
+
- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time. Generation cost was modest (one batch of 4 variants per reference report).
|
| 180 |
+
- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
|
| 181 |
+
- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.
|
| 182 |
|
| 183 |
## Limitations
|
| 184 |
|