rebrand to ScaffDiff: update headline numbers to 6-seed (0.968 ± 0.194), 69-72% vs baselines, paper title

b770d3c verified 1 day ago

14.6 kB

	---
	license: mit
	pipeline_tag: other
	tags:
	- 3d-scene-completion
	- diffusion
	- point-cloud
	- lidar
	- depth-anything
	- semantickitti
	- scaffold-dominant
	library_name: pytorch
	---

	# ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion

	Single-step, scaffold-dominant diffusion for 3D scene completion on SemanticKITTI. Released alongside two paper submissions: an SMC 2026 short paper on the multi-token Gaussian VAE and ICP-refined ground truth (v2 GT), and an RA-L 2026 journal paper on scaffold-dominance and a 45-minute mixed-scaffold deployment fine-tune. The pipeline runs at 209 ms/frame (4.78 FPS) on a single RTX 4090 — 143× faster than LiDiff and 25× faster than ScoreLiDAR — and the fine-tuned teacher beats both by 69–72 % squared Chamfer in a paired matched-protocol scaffold-free comparison across 6 fine-tune seeds.

	> Note on the previous version of this card. The earlier README framed this work around "modality-agnostic encoder" and "teacher-vs-student distillation," with student CDs reported on a different metric protocol. Those numbers and that framing are superseded by the two paper drafts that this card now mirrors. See [What changed vs. the old card](#what-changed-vs-the-old-card) at the bottom.

	---

	## Two-paper context

	### SMC 2026 (submitted Apr 20, 2026) — PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain
	The companion short paper covers:
	- Multi-token Gaussian VAE (7.1 M params): residual PointNet encoder + 32-query cross-attention pooler producing 32 × 1024-d Gaussian latent tokens; transformer decoder with 5 cross-attention blocks reconstructs 8,000 scene points. Squared CD 0.120 ± 0.026 m² at 1.6 ms / frame, no codebook collapse (vs ~16 m² for the VQ-VAE alternative).
	- Anchor-based ICP ground-truth refinement (v2 GT): per-scan ICP against a w=5 temporal window with displacement-gated acceptance, SOR/ROR cleanup, 0.05 m voxelisation. Yields ~50× denser GT over 23,201 frames.
	- Single-step x₀ scaffold-free diffusion teacher evaluated under a matched-protocol scaffold-free comparison vs LiDiff and ScoreLiDAR on the same 50 frames / same v2 GT / same bbox ±1 m crop.

	### RA-L 2026 (submitted May 2026) — ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion
	The journal paper extends the analysis with:
	- Scaffold-dominance ablations (six controlled runs): zeroing the 108 M PTv3 encoder costs only +58 % CD²; cutting input density from 20 K to 2 K points costs only +2 %; but replacing the GT scaffold with random noise or a regular voxel grid collapses CD² by ≥4 orders of magnitude (>50,000× and ~43,000× respectively).
	- Random-init PTv3 control: removing the 108 M parameters of self-supervised pretraining does not open a LiDAR↔DA2 gap (CD² 0.0279 vs 0.0279 m²). The frozen encoder is therefore not the load-bearing component — the GT coordinate scaffold is.
	- 45-minute mixed-scaffold fine-tune: 3-epoch fine-tune (lr 5e-5, 30 % LIDAR-cropped scaffold mix, t ∈ [50, 400], encoder frozen) that transfers the same denoiser to a deployment-realistic scaffold (per-frame LiDAR sweep, ego-bbox crop, no GT access) at squared CD 0.968 ± 0.194 m² across 6 fine-tune seeds on a paired 50-frame protocol — 69–72 % below LiDiff / ScoreLiDAR, 13× over the pre-FT baseline.
	- Cross-sequence in-distribution evaluation on seqs 00 / 05 / 08 (range 6.7 % in CD²), DDIM multi-step sweep (single-step at t=200 is optimal), 500-frame scaffold-quality jitter sweep, iterative self-scaffolding drift study, and an N-seed ensemble that confirms the LiDAR↔DA2 gap stays within sampling noise (max \|ΔCD²\| ≤ 7×10⁻⁴, paired Wilcoxon p > 0.75) at every N ∈ {1, 2, 4, 8}.

	---

	## Headline results

	All numbers are squared symmetric Chamfer distance in m², matching the PVD / LiDiff convention. Linear CD in m is reported where it enables direct comparison.

	### A) GT-scaffold full-validation, seq 08 (4,071 frames, 20 K subsample)
	\| Configuration \| CD lin (m) \| CD² (m²) \| F@0.2 ↑ \| H₉₅ (m) ↓ \| Latency \|
	\|--- \|--- \|--- \|--- \|--- \|--- \|
	\| Ours, v1 GT (standard accumulation)\| 0.496 ± 0.05 \| 0.396 ± 0.090 \| 0.104 \| 1.113 \| 209 ms / 4.78 FPS \|
	\| Ours, v2 GT (ICP-refined) \| 0.138 ± 0.015 \| 0.024 ± 0.005 \| 0.844 \| 0.285 \| 209 ms / 4.78 FPS \|

	Data-quality dominance: 16.5× CD² gain from GT refinement alone, no model change.

	### B) Matched-protocol scaffold-free comparison, v2 GT, paired stride-80 50 frames
	\| Method \| Variant \| CD² (m²) ↓ \|
	\|--- \|--- \|--- \|
	\| LiDiff (Nunes et al., CVPR 2024) \| 50-step DDPM, diff \| 3.41 ± 2.55 \|
	\| LiDiff \| + refine head \| 3.50 ± 2.62 \|
	\| ScoreLiDAR (Zhang et al., ICCV 2025) \| 8-step, diff \| 3.19 ± 2.59 \|
	\| ScoreLiDAR \| + refine head \| 3.15 ± 2.60 \|
	\| Ours (teacher, GT-scaffold OOD here) \| single-step x₀ \| 12.58 ± 8.14 \|
	\| Ours (teacher-FT, mixed scaffold, 6 seeds) \| single-step x₀ (kdtree match) \| 0.968 ± 0.194 \|

	69–72 % below LiDiff / ScoreLiDAR (6-seed across-seed mean ± std, epoch-2 selection; per-seed kdtree CD² 0.73 / 1.03 / 0.75 / 1.12 / 0.97 / 1.21). Paired Wilcoxon p < 1e-12 (seed 42 reference). End-to-end latency is unchanged at 209 ms/frame: 143× faster than LiDiff (30 s/frame, 50 steps), 25× faster than ScoreLiDAR (5.37 s/frame, 8 steps). Non-diffusion LiNeXt (167 ms) is the only comparable-latency baseline.

	### C) Cross-sequence in-distribution evaluation
	\| Sequence \| n frames \| CD² (m²) \| F@0.2 \| H₉₅ (m) \|
	\|--- \|--- \|--- \|--- \|--- \|
	\| Seq 00 \| 500 \| 0.024 ± 0.004 \| 0.844 \| 0.284 \|
	\| Seq 05 \| 500 \| 0.026 ± 0.004 \| 0.825 \| 0.293 \|
	\| Seq 08 \| 4,071 \| 0.024 ± 0.005 \| 0.844 \| 0.285 \|

	CD² range across the three sequences is 6.7 % of the smallest value — well below per-frame std.

	### D) Modality-agnostic full-val (4,071 frames, v2 GT)
	LiDAR vs DA2 (Depth Anything V2 monocular pseudo-LiDAR):
	- CD lin: 0.1381 vs 0.1381 m
	- CD²: 0.0241 vs 0.0241 m²
	- JSD: 0.0337 vs 0.0337
	- F@0.2: 0.8439 vs 0.8435
	- H₉₅: 0.2854 vs 0.2863 m
	- \|ΔCD²\| < 4×10⁻⁴ m²; Wilcoxon p > 0.75. Result also holds with a randomly initialised PTv3 encoder.

	### E) VAE reconstruction (SMC paper)
	\| Metric \| Value \|
	\|--- \|--- \|
	\| Squared CD \| 0.120 ± 0.026 m² \|
	\| Parameters \| 7.1 M \|
	\| Inference \| 1.6 ms / frame \|
	\| Decoded points \| 8,000 \|
	\| Latent \| 32 tokens × 1024-d \|

	---

	## Available checkpoints

	\| File \| Size \| What it is \| Where used in papers \|
	\|--- \|--- \|--- \|--- \|
	\| `teacher_v2gt/best_model.pth` \| 542 MB \| Diffusion teacher trained with GT-coordinate scaffold on v2 GT, 30 ep \| SMC + RA-L Tables I, IV, V, VI \|
	\| `teacher_v2gt_ft/best_model.pth` (NEW) \| 470 MB \| Mixed-scaffold fine-tune (3 ep, 45 min, 30 % LIDAR-cropped scaffold mix), seed 42 reference; 6-seed evaluation in RA-L paper \| RA-L Table III (the 0.968 m² row) \|
	\| `teacher_v1gt/best_model.pth` \| 542 MB \| Same architecture trained on v1 (standard) GT — for the 16× refinement comparison \| SMC + RA-L Table I \|
	\| `vae_v3/best_point_vae.pth` \| ~28 MB \| Multi-token Gaussian VAE (7.1 M params, 32 × 1024-d latent) \| SMC §III-B, Table II \|

	All four share the frozen Sonata / PTv3 encoder (108 M params), loaded from the upstream Sonata release; only the small denoiser / VAE heads are trained.

	---

	## Architecture

	- Encoder: frozen Point Transformer V3 / Sonata (108 M params, self-supervised pretraining), `voxel_size=0.05 m`, up to 20,000 input points → 256-dim per-point conditioning features.
	- Denoiser: 8.9 M-parameter PTv3-style grouped vector-attention network, ε-prediction under cosine schedule (1,000 timesteps).
	- Inference: scaffold-based single-step x₀ at t = 200 (ᾱ₂₀₀ ≈ 0.748, SNR ≈ 2.96). The denoiser refines a noised-GT scaffold rather than generating points from pure Gaussian noise — this is the load-bearing design choice.
	- Refinement: lightweight kNN-interpolation, ~10 ms.
	- End-to-end: 129 ms encoder + 70 ms denoiser + 10 ms refinement = 209 ms / frame on RTX 4090.

	---

	## Quick-start usage

	The reference inference and evaluation entry points are in the GitHub repo
	(<https://github.com/A-C-Simon/sonata_ws/tree/main/sonata-workspace>):

	\| Use case \| Script \|
	\|---\|---\|
	\| In-distribution (GT-scaffold) full-val on seq 08 \| `evaluate_ral_metrics.py --config teacher_v2gt_lidar_v2 --num_frames 4071` \|
	\| In-distribution + custom sequence \| `evaluate_ral_metrics_v2.py --config ... --sequence 00 --num_frames 500` \|
	\| Paired matched-protocol scaffold-free vs LiDiff/ScoreLiDAR \| `run_scaffoldfree_fair_str80.py` (pre-FT) / `run_scaffoldfree_fair_finetuned.py` (FT) \|
	\| Mixed-scaffold deployment fine-tune (45 min, RTX 4090) \| `finetune_mixed_scaffold.py` \|
	\| Per-frame paired Wilcoxon eval \| `eval_wilcoxon.py` \|

	Loading: each `.pth` is a standard PyTorch state-dict checkpoint of the diffusion teacher (frozen Sonata/PTv3 encoder + 8.9 M-parameter denoiser). See the `build_teacher_model` helper in `evaluate_ral_metrics_v2.py` for the construction.

	For the original (pre-fine-tune) checkpoint use `teacher_v2gt/best_model.pth` and a noised-GT scaffold; this gives the 0.024 m² CD² in-distribution number but is out-of-distribution on per-frame LiDAR-only scaffolds (RA-L §IV-D / Table IV pre-FT rows). For deployment-style scaffolds (per-frame LiDAR sweep, ego-bbox crop, no GT) use the fine-tuned checkpoint `teacher_v2gt_ft/best_model.pth`, which delivers 0.727 m² CD² on the same matched protocol.

	---

	## Citation

	Both papers are under review; please cite the appropriate venue.

	```bibtex
	@inproceedings{agbasiere2026pointdiffusion,
	title = {PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain},
	author = {Agbasiere, Chidera and Sannikov, Mikhail and Ogunwoye, Faith and
	Shaikhiev, Erik and Kozinov, Alex and Mikhalchuk, Ilya and
	Zhura, Iana and Tsetserukou, Dzmitry},
	booktitle = {Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC)},
	year = {2026},
	note = {Submitted}
	}

	@article{zhura2026scaffdiff,
	title = {ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion},
	author = {Zhura, Iana and Sannikov, Mikhail and Agbasiere, Chidera and
	Shaikhiev, Erik and Ogunwoye, Faith and Kozinov, Alex and
	Mikhalchuk, Ilya and Tsetserukou, Dzmitry},
	journal = {IEEE Robotics and Automation Letters (under review)},
	year = {2026}
	}
	```

	---

	## Reproducibility

	- Code: <https://github.com/A-C-Simon/sonata_ws> (upstream) and <https://github.com/msannikov03/sonata_ws> (development fork). All training and evaluation scripts, the 8-metric per-frame JSON archives backing every table, the 45-minute FT script, and the Mitsuba renderer for Fig. 1 are in the repo.
	- Compute: a single NVIDIA RTX 4090 (24 GB) suffices for both training (~10 h, 30 epochs) and inference (4.78 FPS at 20 K / 20 K).
	- Data: SemanticKITTI sequences 00–07, 09–10 (train) and 08 (val). The v2 ICP-refined GT (23,201 frames) is built by Algorithm 1 in the SMC paper; reconstruction script in `sonata-workspace/gt_refinement/`.
	- Eval scripts:
	- `evaluate_ral_metrics.py` — Protocol B (8 metrics, 20 K subsample, full val)
	- `fair_scaffold_free_eval.py` — paired stride-80 50-frame matched-protocol comparison vs LiDiff / ScoreLiDAR
	- `ablation_runner.py` — six scaffold-dominance ablations
	- `evaluation_ddim.py` — 16-config DDIM sweep
	- Per-frame JSONs:
	- `eval_v1gt_fullval/teacher_v2gt_lidar_v1/all_metrics.json`
	- `eval_v2gt_fullval/teacher_v2gt_lidar_v2/all_metrics.json`
	- `eval_da2_fullval/teacher_v2gt_da2_v2/all_metrics.json`
	- `evaluation_ddim_results.json`, `ablation_results.json`

	---

	## What changed vs. the old card

	The previous version of this README is superseded. A summary of removed / updated claims:

	\| Old claim \| Status \| Replaced by \|
	\|--- \|--- \|--- \|
	\| "Modality-agnostic encoder" headline framing \| Deprecated \| Scaffold-dominant framing — the GT coordinate scaffold, not the encoder, is load-bearing (RA-L §V) \|
	\| Teacher-vs-Student distillation tables (CD 0.039 / 0.040 m²) \| Deprecated \| RA-L random-init PTv3 control (Table III) shows there is no student / teacher gap to bridge \|
	\| "45–1250× faster than published diffusion methods" / "24 ms / frame, 42 FPS" \| Corrected \| End-to-end 209 ms / frame (4.78 FPS), 143× LiDiff / 25× ScoreLiDAR (RA-L Table V) \|
	\| Headline CD² 0.039 ± 0.009 m² (Protocol A, ~50 frames, 10 K subsample) \| Superseded \| Headline is now CD² 0.024 ± 0.005 m² (Protocol B, 4,071 frames, 20 K subsample) \|
	\| Student-CD numbers / centering-bug erratum \| Resolved \| The student model is no longer part of the story — the SMC and RA-L drafts contain no student row \|
	\| "Citation: Under review." \| Updated \| Two distinct submissions (SMC + RA-L), BibTeX above \|

	The two qualitative figures (`teacher_v2gt_sample1.png`, `teacher_v2gt_sample2.png`, `vae_v3_sample.png`) remain valid and accompany this release.

	License: MIT (unchanged).