File size: 14,649 Bytes
96ad48f
 
9e5d54c
96ad48f
 
 
 
 
9e5d54c
96ad48f
9e5d54c
 
96ad48f
 
b770d3c
8bdb8c4
b770d3c
8bdb8c4
9e5d54c
8bdb8c4
9e5d54c
8bdb8c4
9e5d54c
8bdb8c4
9e5d54c
 
 
 
 
8bdb8c4
b770d3c
9e5d54c
 
 
b770d3c
9e5d54c
d107dfe
9e5d54c
d107dfe
9e5d54c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b770d3c
9e5d54c
b770d3c
9e5d54c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d107dfe
9e5d54c
d107dfe
9e5d54c
d107dfe
9e5d54c
 
 
b770d3c
9e5d54c
 
d107dfe
9e5d54c
d107dfe
9e5d54c
d107dfe
9e5d54c
f0a0140
9e5d54c
 
 
 
 
f0a0140
9e5d54c
f0a0140
9e5d54c
f0a0140
9e5d54c
 
f0a0140
9e5d54c
 
 
 
 
 
 
d17c4a8
9e5d54c
d17c4a8
9e5d54c
f0a0140
9e5d54c
8bdb8c4
9e5d54c
8bdb8c4
9e5d54c
 
 
 
 
 
 
 
 
 
 
 
 
b770d3c
 
9e5d54c
 
 
 
 
 
 
8bdb8c4
9e5d54c
f0a0140
9e5d54c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0a0140
9e5d54c
f0a0140
9e5d54c
f0a0140
9e5d54c
f0a0140
9e5d54c
 
 
 
 
 
 
 
f0a0140
9e5d54c
f0a0140
9e5d54c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: mit
pipeline_tag: other
tags:
  - 3d-scene-completion
  - diffusion
  - point-cloud
  - lidar
  - depth-anything
  - semantickitti
  - scaffold-dominant
library_name: pytorch
---

# ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion

Single-step, scaffold-dominant diffusion for 3D scene completion on SemanticKITTI. Released alongside two paper submissions: an SMC 2026 short paper on the multi-token Gaussian VAE and ICP-refined ground truth (v2 GT), and an RA-L 2026 journal paper on scaffold-dominance and a 45-minute mixed-scaffold deployment fine-tune. The pipeline runs at 209 ms/frame (4.78 FPS) on a single RTX 4090 — 143× faster than LiDiff and 25× faster than ScoreLiDAR — and the fine-tuned teacher beats both by 69–72 % squared Chamfer in a paired matched-protocol scaffold-free comparison across 6 fine-tune seeds.

> Note on the previous version of this card. The earlier README framed this work around "modality-agnostic encoder" and "teacher-vs-student distillation," with student CDs reported on a different metric protocol. Those numbers and that framing are superseded by the two paper drafts that this card now mirrors. See [What changed vs. the old card](#what-changed-vs-the-old-card) at the bottom.

---

## Two-paper context

### SMC 2026 (submitted Apr 20, 2026) — *PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain*
The companion short paper covers:
- **Multi-token Gaussian VAE** (7.1 M params): residual PointNet encoder + 32-query cross-attention pooler producing 32 × 1024-d Gaussian latent tokens; transformer decoder with 5 cross-attention blocks reconstructs 8,000 scene points. Squared CD **0.120 ± 0.026 m²** at **1.6 ms / frame**, no codebook collapse (vs ~16 m² for the VQ-VAE alternative).
- **Anchor-based ICP ground-truth refinement (v2 GT)**: per-scan ICP against a w=5 temporal window with displacement-gated acceptance, SOR/ROR cleanup, 0.05 m voxelisation. Yields ~50× denser GT over 23,201 frames.
- **Single-step x₀ scaffold-free diffusion teacher** evaluated under a matched-protocol scaffold-free comparison vs LiDiff and ScoreLiDAR on the same 50 frames / same v2 GT / same bbox ±1 m crop.

### RA-L 2026 (submitted May 2026) — *ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion*
The journal paper extends the analysis with:
- **Scaffold-dominance ablations** (six controlled runs): zeroing the 108 M PTv3 encoder costs only +58 % CD²; cutting input density from 20 K to 2 K points costs only +2 %; but replacing the GT scaffold with random noise or a regular voxel grid collapses CD² by ≥4 orders of magnitude (>50,000× and ~43,000× respectively).
- **Random-init PTv3 control**: removing the 108 M parameters of self-supervised pretraining does not open a LiDAR↔DA2 gap (CD² 0.0279 vs 0.0279 m²). The frozen encoder is therefore not the load-bearing component — the GT coordinate scaffold is.
- **45-minute mixed-scaffold fine-tune**: 3-epoch fine-tune (lr 5e-5, 30 % LIDAR-cropped scaffold mix, t ∈ [50, 400], encoder frozen) that transfers the same denoiser to a deployment-realistic scaffold (per-frame LiDAR sweep, ego-bbox crop, **no GT access**) at squared CD **0.968 ± 0.194 m²** across 6 fine-tune seeds on a paired 50-frame protocol — **69–72 % below LiDiff / ScoreLiDAR**, 13× over the pre-FT baseline.
- **Cross-sequence in-distribution evaluation** on seqs 00 / 05 / 08 (range 6.7 % in CD²), DDIM multi-step sweep (single-step at t=200 is optimal), 500-frame scaffold-quality jitter sweep, iterative self-scaffolding drift study, and an N-seed ensemble that confirms the LiDAR↔DA2 gap stays within sampling noise (max |ΔCD²| ≤ 7×10⁻⁴, paired Wilcoxon p > 0.75) at every N ∈ {1, 2, 4, 8}.

---

## Headline results

All numbers are squared symmetric Chamfer distance in m², matching the PVD / LiDiff convention. Linear CD in m is reported where it enables direct comparison.

### A) GT-scaffold full-validation, seq 08 (4,071 frames, 20 K subsample)
| Configuration                          | CD lin (m)   | CD² (m²)            | F@0.2 ↑ | H₉₅ (m) ↓ | Latency        |
|---                                     |---           |---                  |---      |---        |---             |
| Ours, **v1 GT** (standard accumulation)| 0.496 ± 0.05 | 0.396 ± 0.090       | 0.104   | 1.113     | 209 ms / 4.78 FPS |
| **Ours, v2 GT (ICP-refined)**          | **0.138 ± 0.015** | **0.024 ± 0.005** | **0.844** | **0.285** | 209 ms / 4.78 FPS |

**Data-quality dominance**: 16.5× CD² gain from GT refinement alone, no model change.

### B) Matched-protocol scaffold-free comparison, v2 GT, paired stride-80 50 frames
| Method                                  | Variant                          | CD² (m²) ↓        |
|---                                      |---                               |---                |
| LiDiff (Nunes et al., CVPR 2024)        | 50-step DDPM, diff               | 3.41 ± 2.55       |
| LiDiff                                  | + refine head                    | 3.50 ± 2.62       |
| ScoreLiDAR (Zhang et al., ICCV 2025)    | 8-step, diff                     | 3.19 ± 2.59       |
| ScoreLiDAR                              | + refine head                    | 3.15 ± 2.60       |
| Ours (teacher, GT-scaffold OOD here)    | single-step x₀                   | 12.58 ± 8.14      |
| **Ours (teacher-FT, mixed scaffold, 6 seeds)** | **single-step x₀ (kdtree match)** | **0.968 ± 0.194** |

**69–72 % below LiDiff / ScoreLiDAR** (6-seed across-seed mean ± std, epoch-2 selection; per-seed kdtree CD² 0.73 / 1.03 / 0.75 / 1.12 / 0.97 / 1.21). Paired Wilcoxon p < 1e-12 (seed 42 reference). End-to-end latency is unchanged at 209 ms/frame: 143× faster than LiDiff (30 s/frame, 50 steps), 25× faster than ScoreLiDAR (5.37 s/frame, 8 steps). Non-diffusion LiNeXt (167 ms) is the only comparable-latency baseline.

### C) Cross-sequence in-distribution evaluation
| Sequence | n frames | CD² (m²)         | F@0.2 | H₉₅ (m) |
|---       |---       |---               |---    |---      |
| Seq 00   | 500      | 0.024 ± 0.004    | 0.844 | 0.284   |
| Seq 05   | 500      | 0.026 ± 0.004    | 0.825 | 0.293   |
| Seq 08   | 4,071    | 0.024 ± 0.005    | 0.844 | 0.285   |

CD² range across the three sequences is 6.7 % of the smallest value — well below per-frame std.

### D) Modality-agnostic full-val (4,071 frames, v2 GT)
LiDAR vs DA2 (Depth Anything V2 monocular pseudo-LiDAR):
- CD lin: 0.1381 vs 0.1381 m
- CD²: 0.0241 vs 0.0241 m²
- JSD: 0.0337 vs 0.0337
- F@0.2: 0.8439 vs 0.8435
- H₉₅: 0.2854 vs 0.2863 m
- |ΔCD²| < 4×10⁻⁴ m²; Wilcoxon p > 0.75. Result also holds with a **randomly initialised** PTv3 encoder.

### E) VAE reconstruction (SMC paper)
| Metric | Value |
|---     |---    |
| Squared CD | 0.120 ± 0.026 m² |
| Parameters | 7.1 M |
| Inference  | 1.6 ms / frame |
| Decoded points | 8,000 |
| Latent | 32 tokens × 1024-d |

---

## Available checkpoints

| File                                          | Size  | What it is                                                                 | Where used in papers           |
|---                                            |---    |---                                                                         |---                             |
| `teacher_v2gt/best_model.pth`                 | 542 MB | Diffusion teacher trained with GT-coordinate scaffold on v2 GT, 30 ep      | SMC + RA-L Tables I, IV, V, VI |
| `teacher_v2gt_ft/best_model.pth` *(NEW)*      | 470 MB | Mixed-scaffold fine-tune (3 ep, 45 min, 30 % LIDAR-cropped scaffold mix), seed 42 reference; 6-seed evaluation in RA-L paper | RA-L Table III (the 0.968 m² row) |
| `teacher_v1gt/best_model.pth`                 | 542 MB | Same architecture trained on v1 (standard) GT — for the 16× refinement comparison | SMC + RA-L Table I             |
| `vae_v3/best_point_vae.pth`                   | ~28 MB | Multi-token Gaussian VAE (7.1 M params, 32 × 1024-d latent)                 | SMC §III-B, Table II           |

All four share the frozen Sonata / PTv3 encoder (108 M params), loaded from the upstream Sonata release; only the small denoiser / VAE heads are trained.

---

## Architecture

- **Encoder**: frozen Point Transformer V3 / Sonata (108 M params, self-supervised pretraining), `voxel_size=0.05 m`, up to 20,000 input points → 256-dim per-point conditioning features.
- **Denoiser**: 8.9 M-parameter PTv3-style grouped vector-attention network, ε-prediction under cosine schedule (1,000 timesteps).
- **Inference**: scaffold-based single-step x₀ at **t = 200** (ᾱ₂₀₀ ≈ 0.748, SNR ≈ 2.96). The denoiser refines a noised-GT scaffold rather than generating points from pure Gaussian noise — this is the load-bearing design choice.
- **Refinement**: lightweight kNN-interpolation, ~10 ms.
- **End-to-end**: 129 ms encoder + 70 ms denoiser + 10 ms refinement = **209 ms / frame** on RTX 4090.

---

## Quick-start usage

The reference inference and evaluation entry points are in the GitHub repo
(<https://github.com/A-C-Simon/sonata_ws/tree/main/sonata-workspace>):

| Use case | Script |
|---|---|
| In-distribution (GT-scaffold) full-val on seq 08 | `evaluate_ral_metrics.py --config teacher_v2gt_lidar_v2 --num_frames 4071` |
| In-distribution + custom sequence | `evaluate_ral_metrics_v2.py --config ... --sequence 00 --num_frames 500` |
| Paired matched-protocol scaffold-free vs LiDiff/ScoreLiDAR | `run_scaffoldfree_fair_str80.py` (pre-FT) / `run_scaffoldfree_fair_finetuned.py` (FT) |
| Mixed-scaffold deployment fine-tune (45 min, RTX 4090) | `finetune_mixed_scaffold.py` |
| Per-frame paired Wilcoxon eval | `eval_wilcoxon.py` |

Loading: each `.pth` is a standard PyTorch state-dict checkpoint of the diffusion teacher (frozen Sonata/PTv3 encoder + 8.9 M-parameter denoiser). See the `build_teacher_model` helper in `evaluate_ral_metrics_v2.py` for the construction.

For the original (pre-fine-tune) checkpoint use `teacher_v2gt/best_model.pth` and a noised-GT scaffold; this gives the 0.024 m² CD² in-distribution number but is out-of-distribution on per-frame LiDAR-only scaffolds (RA-L §IV-D / Table IV pre-FT rows). For deployment-style scaffolds (per-frame LiDAR sweep, ego-bbox crop, no GT) use the fine-tuned checkpoint `teacher_v2gt_ft/best_model.pth`, which delivers 0.727 m² CD² on the same matched protocol.

---

## Citation

Both papers are under review; please cite the appropriate venue.

```bibtex
@inproceedings{agbasiere2026pointdiffusion,
  title     = {PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain},
  author    = {Agbasiere, Chidera and Sannikov, Mikhail and Ogunwoye, Faith and
               Shaikhiev, Erik and Kozinov, Alex and Mikhalchuk, Ilya and
               Zhura, Iana and Tsetserukou, Dzmitry},
  booktitle = {Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC)},
  year      = {2026},
  note      = {Submitted}
}

@article{zhura2026scaffdiff,
  title   = {ScaffDiff: Scaffold-Dominant Diffusion for 3D Scene Completion},
  author  = {Zhura, Iana and Sannikov, Mikhail and Agbasiere, Chidera and
             Shaikhiev, Erik and Ogunwoye, Faith and Kozinov, Alex and
             Mikhalchuk, Ilya and Tsetserukou, Dzmitry},
  journal = {IEEE Robotics and Automation Letters (under review)},
  year    = {2026}
}
```

---

## Reproducibility

- **Code**: <https://github.com/A-C-Simon/sonata_ws> (upstream) and <https://github.com/msannikov03/sonata_ws> (development fork). All training and evaluation scripts, the 8-metric per-frame JSON archives backing every table, the 45-minute FT script, and the Mitsuba renderer for Fig. 1 are in the repo.
- **Compute**: a single NVIDIA RTX 4090 (24 GB) suffices for both training (~10 h, 30 epochs) and inference (4.78 FPS at 20 K / 20 K).
- **Data**: SemanticKITTI sequences 00–07, 09–10 (train) and 08 (val). The v2 ICP-refined GT (23,201 frames) is built by Algorithm 1 in the SMC paper; reconstruction script in `sonata-workspace/gt_refinement/`.
- **Eval scripts**:
  - `evaluate_ral_metrics.py` — Protocol B (8 metrics, 20 K subsample, full val)
  - `fair_scaffold_free_eval.py` — paired stride-80 50-frame matched-protocol comparison vs LiDiff / ScoreLiDAR
  - `ablation_runner.py` — six scaffold-dominance ablations
  - `evaluation_ddim.py` — 16-config DDIM sweep
- **Per-frame JSONs**:
  - `eval_v1gt_fullval/teacher_v2gt_lidar_v1/all_metrics.json`
  - `eval_v2gt_fullval/teacher_v2gt_lidar_v2/all_metrics.json`
  - `eval_da2_fullval/teacher_v2gt_da2_v2/all_metrics.json`
  - `evaluation_ddim_results.json`, `ablation_results.json`

---

## What changed vs. the old card

The previous version of this README is superseded. A summary of removed / updated claims:

| Old claim                                                                            | Status     | Replaced by                                                                                       |
|---                                                                                   |---         |---                                                                                                |
| "Modality-agnostic encoder" headline framing                                         | Deprecated | Scaffold-dominant framing — the GT coordinate scaffold, not the encoder, is load-bearing (RA-L §V) |
| Teacher-vs-Student distillation tables (CD 0.039 / 0.040 m²)                         | Deprecated | RA-L random-init PTv3 control (Table III) shows there is no student / teacher gap to bridge       |
| "45–1250× faster than published diffusion methods" / "24 ms / frame, 42 FPS"         | Corrected  | End-to-end 209 ms / frame (4.78 FPS), 143× LiDiff / 25× ScoreLiDAR (RA-L Table V)                 |
| Headline CD² 0.039 ± 0.009 m² (Protocol A, ~50 frames, 10 K subsample)               | Superseded | Headline is now CD² 0.024 ± 0.005 m² (Protocol B, 4,071 frames, 20 K subsample)                   |
| Student-CD numbers / centering-bug erratum                                           | Resolved   | The student model is no longer part of the story — the SMC and RA-L drafts contain no student row |
| "Citation: Under review."                                                            | Updated    | Two distinct submissions (SMC + RA-L), BibTeX above                                               |

The two qualitative figures (`teacher_v2gt_sample1.png`, `teacher_v2gt_sample2.png`, `vae_v3_sample.png`) remain valid and accompany this release.

License: MIT (unchanged).