Sonata Scene Completion: Diffusion-Based 3D Scene Completion from LiDAR and RGB

Checkpoints for cross-modal diffusion-based 3D scene completion on SemanticKITTI.

Qualitative Results

Scene Completion: Input vs Teacher vs Student vs Ground Truth

Sample 1

Sample 2

Each image shows (left to right): Input (sparse LiDAR scan) | Teacher (LiDAR, CD 0.039) | Student (RGB-only, CD 0.242 mean / 0.119 median) | Ground Truth

VAE v3 Reconstruction

VAE v3 Reconstruction

Panels: Input (LiDAR scan) | VAE v3 Reconstruction | Ground Truth

Models

v2 GT (ICP-refined ground truth, 50x denser)

Model Path CD Input Description
Teacher v2GT teacher_v2gt/best_model.pth 0.039 +/- 0.009 LiDAR Direct diffusion, frozen Sonata/PTv3 encoder (108M) + DenoisingNetwork (8.9M)
Student v2GT student_v2gt/best_model.pth 0.242 +/- 0.266 RGB (DA2 pseudo-depth) Task-loss-only distillation from teacher

v1 GT (original accumulated ground truth)

Model Path CD Input Description
Teacher v1GT teacher_v1gt/best_model.pth 0.608 +/- 0.142 LiDAR Same architecture, trained on v1 GT
Student v1GT student_v1gt/best_model.pth 0.738 +/- 0.245 RGB (DA2 pseudo-depth) Task-loss-only distillation from v1 teacher

VAE (reconstruction, not scene completion)

Model Path CD Description
VAE v3 vae_v3/best_point_vae.pth 0.120 +/- 0.026 VecSet-style cross-attention, 32 tokens, 7.1M params

Architecture

  • Encoder: Frozen Sonata/PTv3 (108M params, pretrained)
  • Denoiser: DenoisingNetwork (8.9M trainable params)
  • Diffusion: 1000 timesteps, cosine schedule, epsilon-prediction
  • Inference: Single-step x0 prediction at t=200
  • Distillation: Task-loss-only (alignment losses are harmful due to cross-modal gradient conflicts)

Key Finding: Gradient Conflicts in Cross-Modal Distillation

Standard multi-loss distillation (task + output matching + feature alignment + structural) fails for cross-modal generative distillation.

Gradient Conflict Analysis

  • Feature alignment gradients conflict with task loss gradients (cosine similarity = -0.023, 58% of batches negative)
  • Structural loss gradient magnitude is 18x larger than task loss, drowning the task signal
  • Task-loss-only achieves the best student performance

Sensor Degradation Robustness

Sensor Degradation

The student (RGB-based) is inherently robust to LiDAR sensor degradation since it does not use LiDAR at inference. The teacher degrades under angular sector occlusion, with crossover at ~13 degrees. For beam dropout, Gaussian noise, and depth estimation noise, the teacher remains robust.

Dataset

SemanticKITTI outdoor driving scenes. v2 GT uses anchor-based ICP refinement (50x denser, 5-10x tighter bounding boxes).

Training Details

Model Epochs LR Batch Size GPU
Teacher v2GT 30 1e-4 2 RTX 4090 24GB
Student v2GT 15 1e-4 2 RTX 4090 24GB
VAE v3 100 3e-4 4 RTX 4090 24GB

Citation

Paper submitted to IEEE SMC 2026.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support