| --- |
| license: cc-by-nc-sa-4.0 |
| tags: |
| - pathology |
| - computational-pathology |
| - whole-slide-image |
| - MICCAI |
| - REG2026 |
| library_name: pytorch |
| --- |
| |
| # REG2026 Interface-1 (Workflow Reasoning / Metric A) — Submission Container |
|
|
| `daipath/reg2026-submit` · self-contained Docker algorithm for the MICCAI **REG2026 (REG²)** |
| challenge, Interface-1. Given one anonymized whole-slide image it produces the |
| chain-of-thought + final pathology report that the official **Metric A** scorer expects. |
|
|
| > **Honest held-out Metric A (val 1019, organ predicted — no ground truth used): 0.9037** |
| > (5-TransMIL ensemble). Single-model variant 0.8950. GT-organ oracle upper bound 0.9062. |
|
|
| --- |
|
|
| ## 0. What is in this repo |
|
|
| | File | Size | What | |
| |---|---|---| |
| | `reg2026_submit.tar` | ~2.7 GB | the complete, self-contained submission container (code + all weights + sample WSI) | |
| | `README.md` | — | this document | |
|
|
| `tar -xf reg2026_submit.tar` gives the folder `reg2026_submit/`: |
|
|
| ``` |
| reg2026_submit/ |
| ├── Dockerfile requirements.txt # pytorch:2.9.1-cuda12.6 base + numpy/Pillow/tifffile/imagecodecs/zarr/opencv/timm |
| ├── inference.py core.py # official platform contract (unchanged); MODEL_PATH=/opt/ml/model |
| ├── build_debug.sh # one-shot: build image + run on the sample WSI + validate output |
| ├── batch_infer.py dryrun.py # offline (non-docker) batch / single-slide runners |
| ├── src/interf1/model.py # predict_chain_of_thought(wsi_path) — the whole pipeline |
| ├── src/reg/ {patching,uni2,mil,organ,cot,derive_heads,report_gen}.py |
| ├── model/ # copied to /opt/ml/model in the image |
| │ ├── uni2-h.bin # UNI2-h encoder weights (~2.7 GB) |
| │ ├── mil_transmil_s0..s4.pt # 5-seed TransMIL ensemble (~23 MB each) |
| │ ├── organ_clf.npz # organ classifier (numpy: scaler + logistic-regression) |
| │ ├── routing_smart.json # CoT routing rules (2375 entries) |
| │ ├── label_space.json # answer options per head |
| │ └── canonical_questions.json # normalized -> canonical question strings |
| └── test/input/interf1/... # the official debug WSI d021e460 + inputs.json |
| ``` |
|
|
| --- |
|
|
| ## 1. Can it be run directly? (Quickstart) |
|
|
| **Requirements on the target machine:** Docker daemon access + an NVIDIA GPU + |
| `nvidia-container-toolkit`; ~10 GB disk for the image; ≥12 GB GPU RAM. |
|
|
| ```bash |
| # download (private repo -> needs your HF token) |
| hf download daipath/reg2026-submit reg2026_submit.tar --repo-type model --local-dir . --token <HF_TOKEN> |
| tar -xf reg2026_submit.tar && cd reg2026_submit |
| |
| # build + debug on the sample slide in one shot |
| bash build_debug.sh |
| # -> builds image `reg2026_algorithm`, runs it on test/input/interf1, prints a validated summary |
| ``` |
|
|
| Manual run on any case folder (must contain `images/whole-slide-image/<uid>.tiff`): |
|
|
| ```bash |
| docker run --rm --gpus all --platform=linux/amd64 \ |
| -v "$PWD/test/input/interf1:/input:ro" -v "$PWD/test/output:/output" reg2026_algorithm |
| cat test/output/chain-of-thought.json |
| ``` |
|
|
| Expected on the debug slide: ~43-step CoT, `organ=Breast`, exactly one terminal step, a |
| final-report step reading `Breast, core needle biopsy; 1. Invasive carcinoma NST grade I ...`. |
|
|
| **No-Docker smoke test** (e.g. on the training server, conda env with torch+timm+zarr): |
| `REG_MODEL_PATH=./model python dryrun.py <some>.tiff`, or batch a folder with |
| `batch_infer.py --indir <dir> --outdir <dir> --ckpts model/mil_transmil_s0.pt,...`. |
|
|
| --- |
|
|
| ## 2. Model / pipeline |
|
|
| `predict_chain_of_thought(wsi_path)` in `src/interf1/model.py`: |
|
|
| ``` |
| WSI .tiff (single-level tiled JPEG, 20x) |
| └─ patch_wsi_path : tissue segmentation (HSV-saturation Otsu) + 256-px grid @ tissue≥0.25, |
| │ read straight from the tiled tiff via tifffile+zarr (memory-bounded; a |
| │ multi-GB level-0 array would OOM under a full imread). Cap to ≤8192 |
| │ patches (RandomState(0)) before reading pixels. |
| └─ UNI2-h : 1536-d feature per patch (timm vit_giant_patch14_224, SwiGLU, fp16) |
| └─ organ classifier: mean|max|std pool (4608-d) -> StandardScaler -> logistic regression |
| │ -> coarse organ (7-way). [val acc 0.986] |
| └─ MIL ensemble : 5× TransMIL multi-head, FiLM-conditioned on the predicted organ, |
| │ softmax-averaged -> 85 head answers |
| └─ derive heads : rule-based (Gleason->grade group, Nottingham->overall grade, ...) |
| └─ report_gen : rule-based structured pathology report from the answered heads |
| └─ assemble_edges : per-organ CoT routing (routing_smart.json) -> canonical Q/A/next steps, |
| final-report node carries the report, last next_question = "" |
| ``` |
|
|
| ### The organ-head fix (important) |
|
|
| The MIL was trained with the **ground-truth organ fed via FiLM**, so its `"What is the organ?"` |
| head learned to *echo* the FiLM input rather than read the tissue. Queried at inference with a |
| placeholder organ it returns the same class for every slide (measured: 1019/1019 "breast", |
| acc 0.215). Our offline 0.90+ numbers had silently relied on the GT organ for both FiLM and |
| routing. The container therefore predicts the organ with a **separate logistic-regression |
| classifier** on pooled UNI2 features (no FiLM, val acc **0.986**) and feeds that prediction to |
| FiLM + routing + the organ answer. Net effect on Metric A vs the GT-organ oracle: only **−0.0025**. |
|
|
| ### Metric A (official scorer, semantic_backend=lexical) |
| |
| Metric A = `0.05·BPV + 0.30·Edge-F1 + 0.25·MESS + 0.40·FinalReport`. |
| |
| | config (cap 8192) | organ | Metric A | BPV / Edge-F1 / MESS / Report | |
| |---|---|---|---| |
| | 5-TransMIL ensemble | GT (oracle) | 0.9062 | 0.779 / 0.980 / 0.930 / 0.852 | |
| | **5-TransMIL ensemble** (default) | **predicted** | **0.9037** | 0.773 / 0.977 / 0.926 / 0.851 | |
| | single TransMIL | predicted | 0.8950 | 0.753 / 0.974 / 0.920 / 0.838 | |
| |
| Measured on the held-out 10% val split (1019 slides, all with GT CoT + report). On the |
| real **Test Phase 1** set (350 slides, no GT) the pipeline runs clean: 350/350 valid, organ |
| distribution sensible (87 cervix ≈ the 70 uterus slides + remainder spread evenly), reports |
| organ-appropriate. |
| |
| --- |
| |
| ## 3. How it was trained |
| |
| **Data.** REG2026 `train_CoT.json` = 11220 annotated cases, each with a full chain-of-thought |
| (median 16 steps) and a final pathology report (99.9%). Split (`split3.json`): |
| train 8176 / val 1019 / test 1019. Organs are 7 coarse classes |
| (breast, colon, stomach, prostate, bladder, lung, cervix). |
|
|
| **1) Patching** (`wsi_patching.py`). 256-px tiles at 20x; tissue mask via HSV-saturation |
| Otsu + morphology + min-area; keep tiles with tissue fraction ≥ 0.25. Training read tiles |
| lazily with OpenSlide/pyvips `read_region` (memory-bounded). The container reproduces the |
| identical patch set via tifffile+zarr (verified: same coordinates and pixels). |
|
|
| **2) Feature extraction** (`feature_extract.py`). Each patch → **UNI2-h** (timm |
| `vit_giant_patch14_224`, SwiGLUPacked, SiLU, 8 register tokens, embed-dim 1536), pipeline |
| `/255 → resize 224 bilinear → ImageNet norm → fp16`. Per slide the patches are capped to 8192 |
| with `np.random.RandomState(0).choice` (`pack_cap.py`), stored as `capped_8192_uni2.h5` |
| (per-slide [N≤8192, 1536]) plus `pooled_8192_uni2.npz` (mean|max|std → 4608-d). |
|
|
| **3) Multi-head MIL** (`train_mil.py`, `MultiHeadMIL`). `proj` Linear(1536→512)+ReLU+Dropout(0.25) |
| → aggregator → `emb` → FiLM organ conditioning → 85 per-question linear heads. |
| - Aggregator: **TransMIL** (transformer pooling with internal patch cap `agg_cap=8192`). |
| ABMIL (gated attention) / CLAM / MambaMIL also implemented. |
| - **FiLM**: `Embedding(7 organs, 512)` → γ, β → `emb·(1+γ)+β`. Conditioned on the GT organ |
| during training (helps the 84 non-organ heads share an organ-specific vocabulary). |
| - Loss: focal cross-entropy with per-class weights, summed **only over the heads applicable to |
| that slide** (the CoT actually asked). CLAM adds a 0.3 instance-clustering term. |
| - Optim: AdamW `lr=1e-4`, `weight_decay=1e-4`, CosineAnnealingLR, 40 epochs, early-stop |
| patience 8, best checkpoint by weighted val accuracy. |
| - **5 seeds** (s0–s4) → softmax-averaged ensemble at inference. |
|
|
| **4) Organ classifier** (`organ.py` / `organ_clf.npz`). The deployed organ predictor is a |
| multinomial logistic regression on the 4608-d pooled features (StandardScaler + LR), trained |
| on train (val held out → **0.986** val accuracy); the shipped artifact is refit on train+val. |
| Exported as pure numpy (`mean, scale, coef, intercept, classes`) — no sklearn at inference. |
|
|
| **5) Rule-based stages** (no learning). |
| - `derive_heads.py`: derives 6 heads from predicted answers (Gleason pattern → grade group, |
| Nottingham sub-scores → overall grade, differentiation, number-of-diagnoses, etc.). |
| - `report_gen.py`: assembles a structured pathology report from the answered heads. |
| - `assemble_edges` + `routing_smart.json`: per-organ deterministic CoT graph (2375 rules = |
| base keys + on-demand context keys + ambiguous keys); pure-rule path validity is 1.0. |
|
|
| --- |
|
|
| ## 4. Variants & notes |
|
|
| - **Default = 5-TransMIL ensemble** (`MIL_CKPTS` in `src/interf1/model.py`). For a lighter, |
| faster single-model build set `MIL_CKPTS = ["mil_transmil_s0.pt"]` (Metric A 0.8950). |
| - Large slides: the biggest Test-Phase-1 slide decompresses to ~104 GB; patching is fully |
| memory-bounded and multi-threaded (peak RSS ~2.3 GB on a 12.9 GB slide), so the container |
| does not OOM regardless of slide size. |
| - The platform passes one slide per run at `/input/images/whole-slide-image/<uid>.tiff` and |
| expects `/output/chain-of-thought.json` (a bare JSON array of {question, answer, |
| next_question}, last next_question = "", canonical question strings, real newlines). |
|
|
| ## 5. Submitting to grand-challenge |
|
|
| Test Phase 1 ground truth is not public, so Metric A on those 350 slides is obtained only by |
| submitting this container; the held-out-val 0.9037 above is the realistic estimate. To submit, |
| build the image, then `bash do_save.sh` to produce the upload archive (or push the image per |
| the challenge instructions). |
|
|
| --- |
| License of underlying data: CC-BY-NC-SA (REG2026). UNI2-h weights are governed by their own |
| MahmoodLab license. |
|
|