lukeingawesome commited on
Commit
33781fc
·
verified ·
1 Parent(s): b7103ad

Add training-procedure section: GPT-4o-mini variant generation with anatomy + category labels

Browse files
Files changed (1) hide show
  1. README.md +26 -2
README.md CHANGED
@@ -152,9 +152,33 @@ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−
152
 
153
  ## Training data
154
 
155
- Trained on `chest2vec/chest2err-train` (in preparation): 53,881 (reference, candidate) pairs across 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (audited radiologist gold).
156
 
157
- The reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus; candidate variants and seeded errors were generated by an LLM following the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) error taxonomy.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
  ## Limitations
160
 
 
152
 
153
  ## Training data
154
 
155
+ Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).
156
 
157
+ ### Variant generation (LLM-injected errors)
158
+
159
+ Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:
160
+
161
+ - **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
162
+ - **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
163
+ - **target finding concept** (leaf finding from the chest CT vocabulary)
164
+
165
+ Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model trains to *reproduce* this structured error trace given only the (reference, candidate) input.
166
+
167
+ ### Training objective
168
+
169
+ Supervised teacher-forced training on the LLM-labeled error sequences:
170
+
171
+ - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
172
+ - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
173
+ - **Severity loss** on the `severity` head (Critical / Minor — added on the radiologist-labeled validation subset)
174
+
175
+ Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
176
+
177
+ ### Why this works
178
+
179
+ - GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time. Generation cost was modest (one batch of 4 variants per reference report).
180
+ - The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
181
+ - Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.
182
 
183
  ## Limitations
184