Sonata Scene Completion: Diffusion-Based 3D Scene Completion from LiDAR and RGB
Checkpoints for cross-modal diffusion-based 3D scene completion on SemanticKITTI.
Qualitative Results
Scene Completion: Input vs Teacher vs Student vs Ground Truth
Each image shows (left to right): Input (sparse LiDAR scan) | Teacher (LiDAR, CD 0.039) | Student (RGB-only, CD 0.242 mean / 0.119 median) | Ground Truth
VAE v3 Reconstruction
Panels: Input (LiDAR scan) | VAE v3 Reconstruction | Ground Truth
Models
v2 GT (ICP-refined ground truth, 50x denser)
| Model | Path | CD | Input | Description |
|---|---|---|---|---|
| Teacher v2GT | teacher_v2gt/best_model.pth |
0.039 +/- 0.009 | LiDAR | Direct diffusion, frozen Sonata/PTv3 encoder (108M) + DenoisingNetwork (8.9M) |
| Student v2GT | student_v2gt/best_model.pth |
0.242 +/- 0.266 | RGB (DA2 pseudo-depth) | Task-loss-only distillation from teacher |
v1 GT (original accumulated ground truth)
| Model | Path | CD | Input | Description |
|---|---|---|---|---|
| Teacher v1GT | teacher_v1gt/best_model.pth |
0.608 +/- 0.142 | LiDAR | Same architecture, trained on v1 GT |
| Student v1GT | student_v1gt/best_model.pth |
0.738 +/- 0.245 | RGB (DA2 pseudo-depth) | Task-loss-only distillation from v1 teacher |
VAE (reconstruction, not scene completion)
| Model | Path | CD | Description |
|---|---|---|---|
| VAE v3 | vae_v3/best_point_vae.pth |
0.120 +/- 0.026 | VecSet-style cross-attention, 32 tokens, 7.1M params |
Architecture
- Encoder: Frozen Sonata/PTv3 (108M params, pretrained)
- Denoiser: DenoisingNetwork (8.9M trainable params)
- Diffusion: 1000 timesteps, cosine schedule, epsilon-prediction
- Inference: Single-step x0 prediction at t=200
- Distillation: Task-loss-only (alignment losses are harmful due to cross-modal gradient conflicts)
Key Finding: Gradient Conflicts in Cross-Modal Distillation
Standard multi-loss distillation (task + output matching + feature alignment + structural) fails for cross-modal generative distillation.
- Feature alignment gradients conflict with task loss gradients (cosine similarity = -0.023, 58% of batches negative)
- Structural loss gradient magnitude is 18x larger than task loss, drowning the task signal
- Task-loss-only achieves the best student performance
Sensor Degradation Robustness
The student (RGB-based) is inherently robust to LiDAR sensor degradation since it does not use LiDAR at inference. The teacher degrades under angular sector occlusion, with crossover at ~13 degrees. For beam dropout, Gaussian noise, and depth estimation noise, the teacher remains robust.
Dataset
SemanticKITTI outdoor driving scenes. v2 GT uses anchor-based ICP refinement (50x denser, 5-10x tighter bounding boxes).
Training Details
| Model | Epochs | LR | Batch Size | GPU |
|---|---|---|---|---|
| Teacher v2GT | 30 | 1e-4 | 2 | RTX 4090 24GB |
| Student v2GT | 15 | 1e-4 | 2 | RTX 4090 24GB |
| VAE v3 | 100 | 3e-4 | 4 | RTX 4090 24GB |
Citation
Paper submitted to IEEE SMC 2026.




