scaffdiff / README.md
businesslion's picture
rebrand to ScaffDiff: update headline numbers to 6-seed (0.968 ± 0.194), 69-72% vs baselines, paper title
b770d3c verified
---
license: mit
pipeline_tag: other
tags:
- 3d-scene-completion
- diffusion
- point-cloud
- lidar
- depth-anything
- semantickitti
- scaffold-dominant
library_name: pytorch
---
# ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion
Single-step, scaffold-dominant diffusion for 3D scene completion on SemanticKITTI. Released alongside two paper submissions: an SMC 2026 short paper on the multi-token Gaussian VAE and ICP-refined ground truth (v2 GT), and an RA-L 2026 journal paper on scaffold-dominance and a 45-minute mixed-scaffold deployment fine-tune. The pipeline runs at 209 ms/frame (4.78 FPS) on a single RTX 4090 — 143× faster than LiDiff and 25× faster than ScoreLiDAR — and the fine-tuned teacher beats both by 69–72 % squared Chamfer in a paired matched-protocol scaffold-free comparison across 6 fine-tune seeds.
> Note on the previous version of this card. The earlier README framed this work around "modality-agnostic encoder" and "teacher-vs-student distillation," with student CDs reported on a different metric protocol. Those numbers and that framing are superseded by the two paper drafts that this card now mirrors. See [What changed vs. the old card](#what-changed-vs-the-old-card) at the bottom.
---
## Two-paper context
### SMC 2026 (submitted Apr 20, 2026) — *PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain*
The companion short paper covers:
- **Multi-token Gaussian VAE** (7.1 M params): residual PointNet encoder + 32-query cross-attention pooler producing 32 × 1024-d Gaussian latent tokens; transformer decoder with 5 cross-attention blocks reconstructs 8,000 scene points. Squared CD **0.120 ± 0.026 m²** at **1.6 ms / frame**, no codebook collapse (vs ~16 m² for the VQ-VAE alternative).
- **Anchor-based ICP ground-truth refinement (v2 GT)**: per-scan ICP against a w=5 temporal window with displacement-gated acceptance, SOR/ROR cleanup, 0.05 m voxelisation. Yields ~50× denser GT over 23,201 frames.
- **Single-step x₀ scaffold-free diffusion teacher** evaluated under a matched-protocol scaffold-free comparison vs LiDiff and ScoreLiDAR on the same 50 frames / same v2 GT / same bbox ±1 m crop.
### RA-L 2026 (submitted May 2026) — *ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion*
The journal paper extends the analysis with:
- **Scaffold-dominance ablations** (six controlled runs): zeroing the 108 M PTv3 encoder costs only +58 % CD²; cutting input density from 20 K to 2 K points costs only +2 %; but replacing the GT scaffold with random noise or a regular voxel grid collapses CD² by ≥4 orders of magnitude (>50,000× and ~43,000× respectively).
- **Random-init PTv3 control**: removing the 108 M parameters of self-supervised pretraining does not open a LiDAR↔DA2 gap (CD² 0.0279 vs 0.0279 m²). The frozen encoder is therefore not the load-bearing component — the GT coordinate scaffold is.
- **45-minute mixed-scaffold fine-tune**: 3-epoch fine-tune (lr 5e-5, 30 % LIDAR-cropped scaffold mix, t ∈ [50, 400], encoder frozen) that transfers the same denoiser to a deployment-realistic scaffold (per-frame LiDAR sweep, ego-bbox crop, **no GT access**) at squared CD **0.968 ± 0.194 m²** across 6 fine-tune seeds on a paired 50-frame protocol — **69–72 % below LiDiff / ScoreLiDAR**, 13× over the pre-FT baseline.
- **Cross-sequence in-distribution evaluation** on seqs 00 / 05 / 08 (range 6.7 % in CD²), DDIM multi-step sweep (single-step at t=200 is optimal), 500-frame scaffold-quality jitter sweep, iterative self-scaffolding drift study, and an N-seed ensemble that confirms the LiDAR↔DA2 gap stays within sampling noise (max |ΔCD²| ≤ 7×10⁻⁴, paired Wilcoxon p > 0.75) at every N ∈ {1, 2, 4, 8}.
---
## Headline results
All numbers are squared symmetric Chamfer distance in m², matching the PVD / LiDiff convention. Linear CD in m is reported where it enables direct comparison.
### A) GT-scaffold full-validation, seq 08 (4,071 frames, 20 K subsample)
| Configuration | CD lin (m) | CD² (m²) | F@0.2 ↑ | H₉₅ (m) ↓ | Latency |
|--- |--- |--- |--- |--- |--- |
| Ours, **v1 GT** (standard accumulation)| 0.496 ± 0.05 | 0.396 ± 0.090 | 0.104 | 1.113 | 209 ms / 4.78 FPS |
| **Ours, v2 GT (ICP-refined)** | **0.138 ± 0.015** | **0.024 ± 0.005** | **0.844** | **0.285** | 209 ms / 4.78 FPS |
**Data-quality dominance**: 16.5× CD² gain from GT refinement alone, no model change.
### B) Matched-protocol scaffold-free comparison, v2 GT, paired stride-80 50 frames
| Method | Variant | CD² (m²) ↓ |
|--- |--- |--- |
| LiDiff (Nunes et al., CVPR 2024) | 50-step DDPM, diff | 3.41 ± 2.55 |
| LiDiff | + refine head | 3.50 ± 2.62 |
| ScoreLiDAR (Zhang et al., ICCV 2025) | 8-step, diff | 3.19 ± 2.59 |
| ScoreLiDAR | + refine head | 3.15 ± 2.60 |
| Ours (teacher, GT-scaffold OOD here) | single-step x₀ | 12.58 ± 8.14 |
| **Ours (teacher-FT, mixed scaffold, 6 seeds)** | **single-step x₀ (kdtree match)** | **0.968 ± 0.194** |
**69–72 % below LiDiff / ScoreLiDAR** (6-seed across-seed mean ± std, epoch-2 selection; per-seed kdtree CD² 0.73 / 1.03 / 0.75 / 1.12 / 0.97 / 1.21). Paired Wilcoxon p < 1e-12 (seed 42 reference). End-to-end latency is unchanged at 209 ms/frame: 143× faster than LiDiff (30 s/frame, 50 steps), 25× faster than ScoreLiDAR (5.37 s/frame, 8 steps). Non-diffusion LiNeXt (167 ms) is the only comparable-latency baseline.
### C) Cross-sequence in-distribution evaluation
| Sequence | n frames | CD² (m²) | F@0.2 | H₉₅ (m) |
|--- |--- |--- |--- |--- |
| Seq 00 | 500 | 0.024 ± 0.004 | 0.844 | 0.284 |
| Seq 05 | 500 | 0.026 ± 0.004 | 0.825 | 0.293 |
| Seq 08 | 4,071 | 0.024 ± 0.005 | 0.844 | 0.285 |
CD² range across the three sequences is 6.7 % of the smallest value — well below per-frame std.
### D) Modality-agnostic full-val (4,071 frames, v2 GT)
LiDAR vs DA2 (Depth Anything V2 monocular pseudo-LiDAR):
- CD lin: 0.1381 vs 0.1381 m
- CD²: 0.0241 vs 0.0241 m²
- JSD: 0.0337 vs 0.0337
- F@0.2: 0.8439 vs 0.8435
- H₉₅: 0.2854 vs 0.2863 m
- |ΔCD²| < 4×10⁻⁴ m²; Wilcoxon p > 0.75. Result also holds with a **randomly initialised** PTv3 encoder.
### E) VAE reconstruction (SMC paper)
| Metric | Value |
|--- |--- |
| Squared CD | 0.120 ± 0.026 m² |
| Parameters | 7.1 M |
| Inference | 1.6 ms / frame |
| Decoded points | 8,000 |
| Latent | 32 tokens × 1024-d |
---
## Available checkpoints
| File | Size | What it is | Where used in papers |
|--- |--- |--- |--- |
| `teacher_v2gt/best_model.pth` | 542 MB | Diffusion teacher trained with GT-coordinate scaffold on v2 GT, 30 ep | SMC + RA-L Tables I, IV, V, VI |
| `teacher_v2gt_ft/best_model.pth` *(NEW)* | 470 MB | Mixed-scaffold fine-tune (3 ep, 45 min, 30 % LIDAR-cropped scaffold mix), seed 42 reference; 6-seed evaluation in RA-L paper | RA-L Table III (the 0.968 m² row) |
| `teacher_v1gt/best_model.pth` | 542 MB | Same architecture trained on v1 (standard) GT — for the 16× refinement comparison | SMC + RA-L Table I |
| `vae_v3/best_point_vae.pth` | ~28 MB | Multi-token Gaussian VAE (7.1 M params, 32 × 1024-d latent) | SMC §III-B, Table II |
All four share the frozen Sonata / PTv3 encoder (108 M params), loaded from the upstream Sonata release; only the small denoiser / VAE heads are trained.
---
## Architecture
- **Encoder**: frozen Point Transformer V3 / Sonata (108 M params, self-supervised pretraining), `voxel_size=0.05 m`, up to 20,000 input points → 256-dim per-point conditioning features.
- **Denoiser**: 8.9 M-parameter PTv3-style grouped vector-attention network, ε-prediction under cosine schedule (1,000 timesteps).
- **Inference**: scaffold-based single-step x₀ at **t = 200** (ᾱ₂₀₀ ≈ 0.748, SNR ≈ 2.96). The denoiser refines a noised-GT scaffold rather than generating points from pure Gaussian noise — this is the load-bearing design choice.
- **Refinement**: lightweight kNN-interpolation, ~10 ms.
- **End-to-end**: 129 ms encoder + 70 ms denoiser + 10 ms refinement = **209 ms / frame** on RTX 4090.
---
## Quick-start usage
The reference inference and evaluation entry points are in the GitHub repo
(<https://github.com/A-C-Simon/sonata_ws/tree/main/sonata-workspace>):
| Use case | Script |
|---|---|
| In-distribution (GT-scaffold) full-val on seq 08 | `evaluate_ral_metrics.py --config teacher_v2gt_lidar_v2 --num_frames 4071` |
| In-distribution + custom sequence | `evaluate_ral_metrics_v2.py --config ... --sequence 00 --num_frames 500` |
| Paired matched-protocol scaffold-free vs LiDiff/ScoreLiDAR | `run_scaffoldfree_fair_str80.py` (pre-FT) / `run_scaffoldfree_fair_finetuned.py` (FT) |
| Mixed-scaffold deployment fine-tune (45 min, RTX 4090) | `finetune_mixed_scaffold.py` |
| Per-frame paired Wilcoxon eval | `eval_wilcoxon.py` |
Loading: each `.pth` is a standard PyTorch state-dict checkpoint of the diffusion teacher (frozen Sonata/PTv3 encoder + 8.9 M-parameter denoiser). See the `build_teacher_model` helper in `evaluate_ral_metrics_v2.py` for the construction.
For the original (pre-fine-tune) checkpoint use `teacher_v2gt/best_model.pth` and a noised-GT scaffold; this gives the 0.024 m² CD² in-distribution number but is out-of-distribution on per-frame LiDAR-only scaffolds (RA-L §IV-D / Table IV pre-FT rows). For deployment-style scaffolds (per-frame LiDAR sweep, ego-bbox crop, no GT) use the fine-tuned checkpoint `teacher_v2gt_ft/best_model.pth`, which delivers 0.727 m² CD² on the same matched protocol.
---
## Citation
Both papers are under review; please cite the appropriate venue.
```bibtex
@inproceedings{agbasiere2026pointdiffusion,
title = {PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain},
author = {Agbasiere, Chidera and Sannikov, Mikhail and Ogunwoye, Faith and
Shaikhiev, Erik and Kozinov, Alex and Mikhalchuk, Ilya and
Zhura, Iana and Tsetserukou, Dzmitry},
booktitle = {Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC)},
year = {2026},
note = {Submitted}
}
@article{zhura2026scaffdiff,
title = {ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion},
author = {Zhura, Iana and Sannikov, Mikhail and Agbasiere, Chidera and
Shaikhiev, Erik and Ogunwoye, Faith and Kozinov, Alex and
Mikhalchuk, Ilya and Tsetserukou, Dzmitry},
journal = {IEEE Robotics and Automation Letters (under review)},
year = {2026}
}
```
---
## Reproducibility
- **Code**: <https://github.com/A-C-Simon/sonata_ws> (upstream) and <https://github.com/msannikov03/sonata_ws> (development fork). All training and evaluation scripts, the 8-metric per-frame JSON archives backing every table, the 45-minute FT script, and the Mitsuba renderer for Fig. 1 are in the repo.
- **Compute**: a single NVIDIA RTX 4090 (24 GB) suffices for both training (~10 h, 30 epochs) and inference (4.78 FPS at 20 K / 20 K).
- **Data**: SemanticKITTI sequences 00–07, 09–10 (train) and 08 (val). The v2 ICP-refined GT (23,201 frames) is built by Algorithm 1 in the SMC paper; reconstruction script in `sonata-workspace/gt_refinement/`.
- **Eval scripts**:
- `evaluate_ral_metrics.py` — Protocol B (8 metrics, 20 K subsample, full val)
- `fair_scaffold_free_eval.py` — paired stride-80 50-frame matched-protocol comparison vs LiDiff / ScoreLiDAR
- `ablation_runner.py` — six scaffold-dominance ablations
- `evaluation_ddim.py` — 16-config DDIM sweep
- **Per-frame JSONs**:
- `eval_v1gt_fullval/teacher_v2gt_lidar_v1/all_metrics.json`
- `eval_v2gt_fullval/teacher_v2gt_lidar_v2/all_metrics.json`
- `eval_da2_fullval/teacher_v2gt_da2_v2/all_metrics.json`
- `evaluation_ddim_results.json`, `ablation_results.json`
---
## What changed vs. the old card
The previous version of this README is superseded. A summary of removed / updated claims:
| Old claim | Status | Replaced by |
|--- |--- |--- |
| "Modality-agnostic encoder" headline framing | Deprecated | Scaffold-dominant framing — the GT coordinate scaffold, not the encoder, is load-bearing (RA-L §V) |
| Teacher-vs-Student distillation tables (CD 0.039 / 0.040 m²) | Deprecated | RA-L random-init PTv3 control (Table III) shows there is no student / teacher gap to bridge |
| "45–1250× faster than published diffusion methods" / "24 ms / frame, 42 FPS" | Corrected | End-to-end 209 ms / frame (4.78 FPS), 143× LiDiff / 25× ScoreLiDAR (RA-L Table V) |
| Headline CD² 0.039 ± 0.009 m² (Protocol A, ~50 frames, 10 K subsample) | Superseded | Headline is now CD² 0.024 ± 0.005 m² (Protocol B, 4,071 frames, 20 K subsample) |
| Student-CD numbers / centering-bug erratum | Resolved | The student model is no longer part of the story — the SMC and RA-L drafts contain no student row |
| "Citation: Under review." | Updated | Two distinct submissions (SMC + RA-L), BibTeX above |
The two qualitative figures (`teacher_v2gt_sample1.png`, `teacher_v2gt_sample2.png`, `vae_v3_sample.png`) remain valid and accompany this release.
License: MIT (unchanged).