--- license: mit pipeline_tag: other tags: - 3d-scene-completion - diffusion - point-cloud - lidar - depth-anything - semantickitti - scaffold-dominant library_name: pytorch --- # ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion Single-step, scaffold-dominant diffusion for 3D scene completion on SemanticKITTI. Released alongside two paper submissions: an SMC 2026 short paper on the multi-token Gaussian VAE and ICP-refined ground truth (v2 GT), and an RA-L 2026 journal paper on scaffold-dominance and a 45-minute mixed-scaffold deployment fine-tune. The pipeline runs at 209 ms/frame (4.78 FPS) on a single RTX 4090 — 143× faster than LiDiff and 25× faster than ScoreLiDAR — and the fine-tuned teacher beats both by 69–72 % squared Chamfer in a paired matched-protocol scaffold-free comparison across 6 fine-tune seeds. > Note on the previous version of this card. The earlier README framed this work around "modality-agnostic encoder" and "teacher-vs-student distillation," with student CDs reported on a different metric protocol. Those numbers and that framing are superseded by the two paper drafts that this card now mirrors. See [What changed vs. the old card](#what-changed-vs-the-old-card) at the bottom. --- ## Two-paper context ### SMC 2026 (submitted Apr 20, 2026) — *PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain* The companion short paper covers: - **Multi-token Gaussian VAE** (7.1 M params): residual PointNet encoder + 32-query cross-attention pooler producing 32 × 1024-d Gaussian latent tokens; transformer decoder with 5 cross-attention blocks reconstructs 8,000 scene points. Squared CD **0.120 ± 0.026 m²** at **1.6 ms / frame**, no codebook collapse (vs ~16 m² for the VQ-VAE alternative). - **Anchor-based ICP ground-truth refinement (v2 GT)**: per-scan ICP against a w=5 temporal window with displacement-gated acceptance, SOR/ROR cleanup, 0.05 m voxelisation. Yields ~50× denser GT over 23,201 frames. - **Single-step x₀ scaffold-free diffusion teacher** evaluated under a matched-protocol scaffold-free comparison vs LiDiff and ScoreLiDAR on the same 50 frames / same v2 GT / same bbox ±1 m crop. ### RA-L 2026 (submitted May 2026) — *ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion* The journal paper extends the analysis with: - **Scaffold-dominance ablations** (six controlled runs): zeroing the 108 M PTv3 encoder costs only +58 % CD²; cutting input density from 20 K to 2 K points costs only +2 %; but replacing the GT scaffold with random noise or a regular voxel grid collapses CD² by ≥4 orders of magnitude (>50,000× and ~43,000× respectively). - **Random-init PTv3 control**: removing the 108 M parameters of self-supervised pretraining does not open a LiDAR↔DA2 gap (CD² 0.0279 vs 0.0279 m²). The frozen encoder is therefore not the load-bearing component — the GT coordinate scaffold is. - **45-minute mixed-scaffold fine-tune**: 3-epoch fine-tune (lr 5e-5, 30 % LIDAR-cropped scaffold mix, t ∈ [50, 400], encoder frozen) that transfers the same denoiser to a deployment-realistic scaffold (per-frame LiDAR sweep, ego-bbox crop, **no GT access**) at squared CD **0.968 ± 0.194 m²** across 6 fine-tune seeds on a paired 50-frame protocol — **69–72 % below LiDiff / ScoreLiDAR**, 13× over the pre-FT baseline. - **Cross-sequence in-distribution evaluation** on seqs 00 / 05 / 08 (range 6.7 % in CD²), DDIM multi-step sweep (single-step at t=200 is optimal), 500-frame scaffold-quality jitter sweep, iterative self-scaffolding drift study, and an N-seed ensemble that confirms the LiDAR↔DA2 gap stays within sampling noise (max |ΔCD²| ≤ 7×10⁻⁴, paired Wilcoxon p > 0.75) at every N ∈ {1, 2, 4, 8}. --- ## Headline results All numbers are squared symmetric Chamfer distance in m², matching the PVD / LiDiff convention. Linear CD in m is reported where it enables direct comparison. ### A) GT-scaffold full-validation, seq 08 (4,071 frames, 20 K subsample) | Configuration | CD lin (m) | CD² (m²) | F@0.2 ↑ | H₉₅ (m) ↓ | Latency | |--- |--- |--- |--- |--- |--- | | Ours, **v1 GT** (standard accumulation)| 0.496 ± 0.05 | 0.396 ± 0.090 | 0.104 | 1.113 | 209 ms / 4.78 FPS | | **Ours, v2 GT (ICP-refined)** | **0.138 ± 0.015** | **0.024 ± 0.005** | **0.844** | **0.285** | 209 ms / 4.78 FPS | **Data-quality dominance**: 16.5× CD² gain from GT refinement alone, no model change. ### B) Matched-protocol scaffold-free comparison, v2 GT, paired stride-80 50 frames | Method | Variant | CD² (m²) ↓ | |--- |--- |--- | | LiDiff (Nunes et al., CVPR 2024) | 50-step DDPM, diff | 3.41 ± 2.55 | | LiDiff | + refine head | 3.50 ± 2.62 | | ScoreLiDAR (Zhang et al., ICCV 2025) | 8-step, diff | 3.19 ± 2.59 | | ScoreLiDAR | + refine head | 3.15 ± 2.60 | | Ours (teacher, GT-scaffold OOD here) | single-step x₀ | 12.58 ± 8.14 | | **Ours (teacher-FT, mixed scaffold, 6 seeds)** | **single-step x₀ (kdtree match)** | **0.968 ± 0.194** | **69–72 % below LiDiff / ScoreLiDAR** (6-seed across-seed mean ± std, epoch-2 selection; per-seed kdtree CD² 0.73 / 1.03 / 0.75 / 1.12 / 0.97 / 1.21). Paired Wilcoxon p < 1e-12 (seed 42 reference). End-to-end latency is unchanged at 209 ms/frame: 143× faster than LiDiff (30 s/frame, 50 steps), 25× faster than ScoreLiDAR (5.37 s/frame, 8 steps). Non-diffusion LiNeXt (167 ms) is the only comparable-latency baseline. ### C) Cross-sequence in-distribution evaluation | Sequence | n frames | CD² (m²) | F@0.2 | H₉₅ (m) | |--- |--- |--- |--- |--- | | Seq 00 | 500 | 0.024 ± 0.004 | 0.844 | 0.284 | | Seq 05 | 500 | 0.026 ± 0.004 | 0.825 | 0.293 | | Seq 08 | 4,071 | 0.024 ± 0.005 | 0.844 | 0.285 | CD² range across the three sequences is 6.7 % of the smallest value — well below per-frame std. ### D) Modality-agnostic full-val (4,071 frames, v2 GT) LiDAR vs DA2 (Depth Anything V2 monocular pseudo-LiDAR): - CD lin: 0.1381 vs 0.1381 m - CD²: 0.0241 vs 0.0241 m² - JSD: 0.0337 vs 0.0337 - F@0.2: 0.8439 vs 0.8435 - H₉₅: 0.2854 vs 0.2863 m - |ΔCD²| < 4×10⁻⁴ m²; Wilcoxon p > 0.75. Result also holds with a **randomly initialised** PTv3 encoder. ### E) VAE reconstruction (SMC paper) | Metric | Value | |--- |--- | | Squared CD | 0.120 ± 0.026 m² | | Parameters | 7.1 M | | Inference | 1.6 ms / frame | | Decoded points | 8,000 | | Latent | 32 tokens × 1024-d | --- ## Available checkpoints | File | Size | What it is | Where used in papers | |--- |--- |--- |--- | | `teacher_v2gt/best_model.pth` | 542 MB | Diffusion teacher trained with GT-coordinate scaffold on v2 GT, 30 ep | SMC + RA-L Tables I, IV, V, VI | | `teacher_v2gt_ft/best_model.pth` *(NEW)* | 470 MB | Mixed-scaffold fine-tune (3 ep, 45 min, 30 % LIDAR-cropped scaffold mix), seed 42 reference; 6-seed evaluation in RA-L paper | RA-L Table III (the 0.968 m² row) | | `teacher_v1gt/best_model.pth` | 542 MB | Same architecture trained on v1 (standard) GT — for the 16× refinement comparison | SMC + RA-L Table I | | `vae_v3/best_point_vae.pth` | ~28 MB | Multi-token Gaussian VAE (7.1 M params, 32 × 1024-d latent) | SMC §III-B, Table II | All four share the frozen Sonata / PTv3 encoder (108 M params), loaded from the upstream Sonata release; only the small denoiser / VAE heads are trained. --- ## Architecture - **Encoder**: frozen Point Transformer V3 / Sonata (108 M params, self-supervised pretraining), `voxel_size=0.05 m`, up to 20,000 input points → 256-dim per-point conditioning features. - **Denoiser**: 8.9 M-parameter PTv3-style grouped vector-attention network, ε-prediction under cosine schedule (1,000 timesteps). - **Inference**: scaffold-based single-step x₀ at **t = 200** (ᾱ₂₀₀ ≈ 0.748, SNR ≈ 2.96). The denoiser refines a noised-GT scaffold rather than generating points from pure Gaussian noise — this is the load-bearing design choice. - **Refinement**: lightweight kNN-interpolation, ~10 ms. - **End-to-end**: 129 ms encoder + 70 ms denoiser + 10 ms refinement = **209 ms / frame** on RTX 4090. --- ## Quick-start usage The reference inference and evaluation entry points are in the GitHub repo (): | Use case | Script | |---|---| | In-distribution (GT-scaffold) full-val on seq 08 | `evaluate_ral_metrics.py --config teacher_v2gt_lidar_v2 --num_frames 4071` | | In-distribution + custom sequence | `evaluate_ral_metrics_v2.py --config ... --sequence 00 --num_frames 500` | | Paired matched-protocol scaffold-free vs LiDiff/ScoreLiDAR | `run_scaffoldfree_fair_str80.py` (pre-FT) / `run_scaffoldfree_fair_finetuned.py` (FT) | | Mixed-scaffold deployment fine-tune (45 min, RTX 4090) | `finetune_mixed_scaffold.py` | | Per-frame paired Wilcoxon eval | `eval_wilcoxon.py` | Loading: each `.pth` is a standard PyTorch state-dict checkpoint of the diffusion teacher (frozen Sonata/PTv3 encoder + 8.9 M-parameter denoiser). See the `build_teacher_model` helper in `evaluate_ral_metrics_v2.py` for the construction. For the original (pre-fine-tune) checkpoint use `teacher_v2gt/best_model.pth` and a noised-GT scaffold; this gives the 0.024 m² CD² in-distribution number but is out-of-distribution on per-frame LiDAR-only scaffolds (RA-L §IV-D / Table IV pre-FT rows). For deployment-style scaffolds (per-frame LiDAR sweep, ego-bbox crop, no GT) use the fine-tuned checkpoint `teacher_v2gt_ft/best_model.pth`, which delivers 0.727 m² CD² on the same matched protocol. --- ## Citation Both papers are under review; please cite the appropriate venue. ```bibtex @inproceedings{agbasiere2026pointdiffusion, title = {PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain}, author = {Agbasiere, Chidera and Sannikov, Mikhail and Ogunwoye, Faith and Shaikhiev, Erik and Kozinov, Alex and Mikhalchuk, Ilya and Zhura, Iana and Tsetserukou, Dzmitry}, booktitle = {Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC)}, year = {2026}, note = {Submitted} } @article{zhura2026scaffdiff, title = {ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion}, author = {Zhura, Iana and Sannikov, Mikhail and Agbasiere, Chidera and Shaikhiev, Erik and Ogunwoye, Faith and Kozinov, Alex and Mikhalchuk, Ilya and Tsetserukou, Dzmitry}, journal = {IEEE Robotics and Automation Letters (under review)}, year = {2026} } ``` --- ## Reproducibility - **Code**: (upstream) and (development fork). All training and evaluation scripts, the 8-metric per-frame JSON archives backing every table, the 45-minute FT script, and the Mitsuba renderer for Fig. 1 are in the repo. - **Compute**: a single NVIDIA RTX 4090 (24 GB) suffices for both training (~10 h, 30 epochs) and inference (4.78 FPS at 20 K / 20 K). - **Data**: SemanticKITTI sequences 00–07, 09–10 (train) and 08 (val). The v2 ICP-refined GT (23,201 frames) is built by Algorithm 1 in the SMC paper; reconstruction script in `sonata-workspace/gt_refinement/`. - **Eval scripts**: - `evaluate_ral_metrics.py` — Protocol B (8 metrics, 20 K subsample, full val) - `fair_scaffold_free_eval.py` — paired stride-80 50-frame matched-protocol comparison vs LiDiff / ScoreLiDAR - `ablation_runner.py` — six scaffold-dominance ablations - `evaluation_ddim.py` — 16-config DDIM sweep - **Per-frame JSONs**: - `eval_v1gt_fullval/teacher_v2gt_lidar_v1/all_metrics.json` - `eval_v2gt_fullval/teacher_v2gt_lidar_v2/all_metrics.json` - `eval_da2_fullval/teacher_v2gt_da2_v2/all_metrics.json` - `evaluation_ddim_results.json`, `ablation_results.json` --- ## What changed vs. the old card The previous version of this README is superseded. A summary of removed / updated claims: | Old claim | Status | Replaced by | |--- |--- |--- | | "Modality-agnostic encoder" headline framing | Deprecated | Scaffold-dominant framing — the GT coordinate scaffold, not the encoder, is load-bearing (RA-L §V) | | Teacher-vs-Student distillation tables (CD 0.039 / 0.040 m²) | Deprecated | RA-L random-init PTv3 control (Table III) shows there is no student / teacher gap to bridge | | "45–1250× faster than published diffusion methods" / "24 ms / frame, 42 FPS" | Corrected | End-to-end 209 ms / frame (4.78 FPS), 143× LiDiff / 25× ScoreLiDAR (RA-L Table V) | | Headline CD² 0.039 ± 0.009 m² (Protocol A, ~50 frames, 10 K subsample) | Superseded | Headline is now CD² 0.024 ± 0.005 m² (Protocol B, 4,071 frames, 20 K subsample) | | Student-CD numbers / centering-bug erratum | Resolved | The student model is no longer part of the story — the SMC and RA-L drafts contain no student row | | "Citation: Under review." | Updated | Two distinct submissions (SMC + RA-L), BibTeX above | The two qualitative figures (`teacher_v2gt_sample1.png`, `teacher_v2gt_sample2.png`, `vae_v3_sample.png`) remain valid and accompany this release. License: MIT (unchanged).