---
license: mit
pipeline_tag: other
tags:
  - 3d-scene-completion
  - diffusion
  - point-cloud
  - lidar
  - depth-anything
  - semantickitti
  - scaffold-dominant
library_name: pytorch
---

# ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion

Single-step, scaffold-dominant diffusion for 3D scene completion on SemanticKITTI. Released alongside two paper submissions: an SMC 2026 short paper on the multi-token Gaussian VAE and ICP-refined ground truth (v2 GT), and an RA-L 2026 journal paper on scaffold-dominance and a 45-minute mixed-scaffold deployment fine-tune. The pipeline runs at 209 ms/frame (4.78 FPS) on a single RTX 4090 — 143× faster than LiDiff and 25× faster than ScoreLiDAR — and the fine-tuned teacher beats both by 69–72 % squared Chamfer in a paired matched-protocol scaffold-free comparison across 6 fine-tune seeds.

> Note on the previous version of this card. The earlier README framed this work around "modality-agnostic encoder" and "teacher-vs-student distillation," with student CDs reported on a different metric protocol. Those numbers and that framing are superseded by the two paper drafts that this card now mirrors. See [What changed vs. the old card](#what-changed-vs-the-old-card) at the bottom.

---

## Two-paper context

### SMC 2026 (submitted Apr 20, 2026) — *PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain*
The companion short paper covers:
- **Multi-token Gaussian VAE** (7.1 M params): residual PointNet encoder + 32-query cross-attention pooler producing 32 × 1024-d Gaussian latent tokens; transformer decoder with 5 cross-attention blocks reconstructs 8,000 scene points. Squared CD **0.120 ± 0.026 m²** at **1.6 ms / frame**, no codebook collapse (vs ~16 m² for the VQ-VAE alternative).
- **Anchor-based ICP ground-truth refinement (v2 GT)**: per-scan ICP against a w=5 temporal window with displacement-gated acceptance, SOR/ROR cleanup, 0.05 m voxelisation. Yields ~50× denser GT over 23,201 frames.
- **Single-step x₀ scaffold-free diffusion teacher** evaluated under a matched-protocol scaffold-free comparison vs LiDiff and ScoreLiDAR on the same 50 frames / same v2 GT / same bbox ±1 m crop.

### RA-L 2026 (submitted May 2026) — *ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion*
The journal paper extends the analysis with:
- **Scaffold-dominance ablations** (six controlled runs): zeroing the 108 M PTv3 encoder costs only +58 % CD²; cutting input density from 20 K to 2 K points costs only +2 %; but replacing the GT scaffold with random noise or a regular voxel grid collapses CD² by ≥4 orders of magnitude (>50,000× and ~43,000× respectively).
- **Random-init PTv3 control**: removing the 108 M parameters of self-supervised pretraining does not open a LiDAR↔DA2 gap (CD² 0.0279 vs 0.0279 m²). The frozen encoder is therefore not the load-bearing component — the GT coordinate scaffold is.
- **45-minute mixed-scaffold fine-tune**: 3-epoch fine-tune (lr 5e-5, 30 % LIDAR-cropped scaffold mix, t ∈ [50, 400], encoder frozen) that transfers the same denoiser to a deployment-realistic scaffold (per-frame LiDAR sweep, ego-bbox crop, **no GT access**) at squared CD **0.968 ± 0.194 m²** across 6 fine-tune seeds on a paired 50-frame protocol — **69–72 % below LiDiff / ScoreLiDAR**, 13× over the pre-FT baseline.
- **Cross-sequence in-distribution evaluation** on seqs 00 / 05 / 08 (range 6.7 % in CD²), DDIM multi-step sweep (single-step at t=200 is optimal), 500-frame scaffold-quality jitter sweep, iterative self-scaffolding drift study, and an N-seed ensemble that confirms the LiDAR↔DA2 gap stays within sampling noise (max |ΔCD²| ≤ 7×10⁻⁴, paired Wilcoxon p > 0.75) at every N ∈ {1, 2, 4, 8}.

---

## Headline results

All numbers are squared symmetric Chamfer distance in m², matching the PVD / LiDiff convention. Linear CD in m is reported where it enables direct comparison.

### A) GT-scaffold full-validation, seq 08 (4,071 frames, 20 K subsample)
| Configuration                          | CD lin (m)   | CD² (m²)            | F@0.2 ↑ | H₉₅ (m) ↓ | Latency        |
|---                                     |---           |---                  |---      |---        |---             |
| Ours, **v1 GT** (standard accumulation)| 0.496 ± 0.05 | 0.396 ± 0.090       | 0.104   | 1.113     | 209 ms / 4.78 FPS |
| **Ours, v2 GT (ICP-refined)**          | **0.138 ± 0.015** | **0.024 ± 0.005** | **0.844** | **0.285** | 209 ms / 4.78 FPS |

**Data-quality dominance**: 16.5× CD² gain from GT refinement alone, no model change.

### B) Matched-protocol scaffold-free comparison, v2 GT, paired stride-80 50 frames
| Method                                  | Variant                          | CD² (m²) ↓        |
|---                                      |---                               |---                |
| LiDiff (Nunes et al., CVPR 2024)        | 50-step DDPM, diff               | 3.41 ± 2.55       |
| LiDiff                                  | + refine head                    | 3.50 ± 2.62       |
| ScoreLiDAR (Zhang et al., ICCV 2025)    | 8-step, diff                     | 3.19 ± 2.59       |
| ScoreLiDAR                              | + refine head                    | 3.15 ± 2.60       |
| Ours (teacher, GT-scaffold OOD here)    | single-step x₀                   | 12.58 ± 8.14      |
| **Ours (teacher-FT, mixed scaffold, 6 seeds)** | **single-step x₀ (kdtree match)** | **0.968 ± 0.194** |

**69–72 % below LiDiff / ScoreLiDAR** (6-seed across-seed mean ± std, epoch-2 selection; per-seed kdtree CD² 0.73 / 1.03 / 0.75 / 1.12 / 0.97 / 1.21). Paired Wilcoxon p < 1e-12 (seed 42 reference). End-to-end latency is unchanged at 209 ms/frame: 143× faster than LiDiff (30 s/frame, 50 steps), 25× faster than ScoreLiDAR (5.37 s/frame, 8 steps). Non-diffusion LiNeXt (167 ms) is the only comparable-latency baseline.

### C) Cross-sequence in-distribution evaluation
| Sequence | n frames | CD² (m²)         | F@0.2 | H₉₅ (m) |
|---       |---       |---               |---    |---      |
| Seq 00   | 500      | 0.024 ± 0.004    | 0.844 | 0.284   |
| Seq 05   | 500      | 0.026 ± 0.004    | 0.825 | 0.293   |
| Seq 08   | 4,071    | 0.024 ± 0.005    | 0.844 | 0.285   |

CD² range across the three sequences is 6.7 % of the smallest value — well below per-frame std.

### D) Modality-agnostic full-val (4,071 frames, v2 GT)
LiDAR vs DA2 (Depth Anything V2 monocular pseudo-LiDAR):
- CD lin: 0.1381 vs 0.1381 m
- CD²: 0.0241 vs 0.0241 m²
- JSD: 0.0337 vs 0.0337
- F@0.2: 0.8439 vs 0.8435
- H₉₅: 0.2854 vs 0.2863 m
- |ΔCD²| < 4×10⁻⁴ m²; Wilcoxon p > 0.75. Result also holds with a **randomly initialised** PTv3 encoder.

### E) VAE reconstruction (SMC paper)
| Metric | Value |
|---     |---    |
| Squared CD | 0.120 ± 0.026 m² |
| Parameters | 7.1 M |
| Inference  | 1.6 ms / frame |
| Decoded points | 8,000 |
| Latent | 32 tokens × 1024-d |

---

## Available checkpoints

| File                                          | Size  | What it is                                                                 | Where used in papers           |
|---                                            |---    |---                                                                         |---                             |
| `teacher_v2gt/best_model.pth`                 | 542 MB | Diffusion teacher trained with GT-coordinate scaffold on v2 GT, 30 ep      | SMC + RA-L Tables I, IV, V, VI |
| `teacher_v2gt_ft/best_model.pth` *(NEW)*      | 470 MB | Mixed-scaffold fine-tune (3 ep, 45 min, 30 % LIDAR-cropped scaffold mix), seed 42 reference; 6-seed evaluation in RA-L paper | RA-L Table III (the 0.968 m² row) |
| `teacher_v1gt/best_model.pth`                 | 542 MB | Same architecture trained on v1 (standard) GT — for the 16× refinement comparison | SMC + RA-L Table I             |
| `vae_v3/best_point_vae.pth`                   | ~28 MB | Multi-token Gaussian VAE (7.1 M params, 32 × 1024-d latent)                 | SMC §III-B, Table II           |

All four share the frozen Sonata / PTv3 encoder (108 M params), loaded from the upstream Sonata release; only the small denoiser / VAE heads are trained.

---

## Architecture

- **Encoder**: frozen Point Transformer V3 / Sonata (108 M params, self-supervised pretraining), `voxel_size=0.05 m`, up to 20,000 input points → 256-dim per-point conditioning features.
- **Denoiser**: 8.9 M-parameter PTv3-style grouped vector-attention network, ε-prediction under cosine schedule (1,000 timesteps).
- **Inference**: scaffold-based single-step x₀ at **t = 200** (ᾱ₂₀₀ ≈ 0.748, SNR ≈ 2.96). The denoiser refines a noised-GT scaffold rather than generating points from pure Gaussian noise — this is the load-bearing design choice.
- **Refinement**: lightweight kNN-interpolation, ~10 ms.
- **End-to-end**: 129 ms encoder + 70 ms denoiser + 10 ms refinement = **209 ms / frame** on RTX 4090.

---

## Quick-start usage

The reference inference and evaluation entry points are in the GitHub repo
(<https://github.com/A-C-Simon/sonata_ws/tree/main/sonata-workspace>):

| Use case | Script |
|---|---|
| In-distribution (GT-scaffold) full-val on seq 08 | `evaluate_ral_metrics.py --config teacher_v2gt_lidar_v2 --num_frames 4071` |
| In-distribution + custom sequence | `evaluate_ral_metrics_v2.py --config ... --sequence 00 --num_frames 500` |
| Paired matched-protocol scaffold-free vs LiDiff/ScoreLiDAR | `run_scaffoldfree_fair_str80.py` (pre-FT) / `run_scaffoldfree_fair_finetuned.py` (FT) |
| Mixed-scaffold deployment fine-tune (45 min, RTX 4090) | `finetune_mixed_scaffold.py` |
| Per-frame paired Wilcoxon eval | `eval_wilcoxon.py` |

Loading: each `.pth` is a standard PyTorch state-dict checkpoint of the diffusion teacher (frozen Sonata/PTv3 encoder + 8.9 M-parameter denoiser). See the `build_teacher_model` helper in `evaluate_ral_metrics_v2.py` for the construction.

For the original (pre-fine-tune) checkpoint use `teacher_v2gt/best_model.pth` and a noised-GT scaffold; this gives the 0.024 m² CD² in-distribution number but is out-of-distribution on per-frame LiDAR-only scaffolds (RA-L §IV-D / Table IV pre-FT rows). For deployment-style scaffolds (per-frame LiDAR sweep, ego-bbox crop, no GT) use the fine-tuned checkpoint `teacher_v2gt_ft/best_model.pth`, which delivers 0.727 m² CD² on the same matched protocol.

---

## Citation

Both papers are under review; please cite the appropriate venue.

```bibtex
@inproceedings{agbasiere2026pointdiffusion,
  title     = {PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain},
  author    = {Agbasiere, Chidera and Sannikov, Mikhail and Ogunwoye, Faith and
               Shaikhiev, Erik and Kozinov, Alex and Mikhalchuk, Ilya and
               Zhura, Iana and Tsetserukou, Dzmitry},
  booktitle = {Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC)},
  year      = {2026},
  note      = {Submitted}
}

@article{zhura2026scaffdiff,
  title   = {ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion},
  author  = {Zhura, Iana and Sannikov, Mikhail and Agbasiere, Chidera and
             Shaikhiev, Erik and Ogunwoye, Faith and Kozinov, Alex and
             Mikhalchuk, Ilya and Tsetserukou, Dzmitry},
  journal = {IEEE Robotics and Automation Letters (under review)},
  year    = {2026}
}
```

---

## Reproducibility

- **Code**: <https://github.com/A-C-Simon/sonata_ws> (upstream) and <https://github.com/msannikov03/sonata_ws> (development fork). All training and evaluation scripts, the 8-metric per-frame JSON archives backing every table, the 45-minute FT script, and the Mitsuba renderer for Fig. 1 are in the repo.
- **Compute**: a single NVIDIA RTX 4090 (24 GB) suffices for both training (~10 h, 30 epochs) and inference (4.78 FPS at 20 K / 20 K).
- **Data**: SemanticKITTI sequences 00–07, 09–10 (train) and 08 (val). The v2 ICP-refined GT (23,201 frames) is built by Algorithm 1 in the SMC paper; reconstruction script in `sonata-workspace/gt_refinement/`.
- **Eval scripts**:
  - `evaluate_ral_metrics.py` — Protocol B (8 metrics, 20 K subsample, full val)
  - `fair_scaffold_free_eval.py` — paired stride-80 50-frame matched-protocol comparison vs LiDiff / ScoreLiDAR
  - `ablation_runner.py` — six scaffold-dominance ablations
  - `evaluation_ddim.py` — 16-config DDIM sweep
- **Per-frame JSONs**:
  - `eval_v1gt_fullval/teacher_v2gt_lidar_v1/all_metrics.json`
  - `eval_v2gt_fullval/teacher_v2gt_lidar_v2/all_metrics.json`
  - `eval_da2_fullval/teacher_v2gt_da2_v2/all_metrics.json`
  - `evaluation_ddim_results.json`, `ablation_results.json`

---

## What changed vs. the old card

The previous version of this README is superseded. A summary of removed / updated claims:

| Old claim                                                                            | Status     | Replaced by                                                                                       |
|---                                                                                   |---         |---                                                                                                |
| "Modality-agnostic encoder" headline framing                                         | Deprecated | Scaffold-dominant framing — the GT coordinate scaffold, not the encoder, is load-bearing (RA-L §V) |
| Teacher-vs-Student distillation tables (CD 0.039 / 0.040 m²)                         | Deprecated | RA-L random-init PTv3 control (Table III) shows there is no student / teacher gap to bridge       |
| "45–1250× faster than published diffusion methods" / "24 ms / frame, 42 FPS"         | Corrected  | End-to-end 209 ms / frame (4.78 FPS), 143× LiDiff / 25× ScoreLiDAR (RA-L Table V)                 |
| Headline CD² 0.039 ± 0.009 m² (Protocol A, ~50 frames, 10 K subsample)               | Superseded | Headline is now CD² 0.024 ± 0.005 m² (Protocol B, 4,071 frames, 20 K subsample)                   |
| Student-CD numbers / centering-bug erratum                                           | Resolved   | The student model is no longer part of the story — the SMC and RA-L drafts contain no student row |
| "Citation: Under review."                                                            | Updated    | Two distinct submissions (SMC + RA-L), BibTeX above                                               |

The two qualitative figures (`teacher_v2gt_sample1.png`, `teacher_v2gt_sample2.png`, `vae_v3_sample.png`) remain valid and accompany this release.

License: MIT (unchanged).