chest2vec
/

chest2err

@@ -152,9 +152,33 @@ The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−
 ## Training data
-Trained on `chest2vec/chest2err-train` (in preparation): 53,881 (reference, candidate) pairs across 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (audited radiologist gold).
-The reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus; candidate variants and seeded errors were generated by an LLM following the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) error taxonomy.
 ## Limitations

 ## Training data
+Trained on `chest2vec/chest2err-train` (in preparation): **53,881 (reference, candidate, labeled_errors)** triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (radiologist gold).
+### Variant generation (LLM-injected errors)
+Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) chest CT corpus. For each reference report we prompted **GPT-4o-mini** to produce four candidate variants that **deliberately insert a controlled number of errors** drawn from the [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:
+- **error category** (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
+- **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
+- **target finding concept** (leaf finding from the chest CT vocabulary)
+Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model trains to *reproduce* this structured error trace given only the (reference, candidate) input.
+### Training objective
+Supervised teacher-forced training on the LLM-labeled error sequences:
+- **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
+- **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
+- **Severity loss** on the `severity` head (Critical / Minor — added on the radiologist-labeled validation subset)
+Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
+### Why this works
+- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time. Generation cost was modest (one batch of 4 variants per reference report).
+- The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
+- Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.
 ## Limitations