Sonata Scene Completion: Diffusion-Based 3D Scene Completion from LiDAR and RGB

Checkpoints for cross-modal diffusion-based 3D scene completion on SemanticKITTI.

Qualitative Results

Scene Completion: Input vs Teacher vs Student vs Ground Truth

Each image shows (left to right): Input (sparse LiDAR scan) | Teacher (LiDAR, CD 0.039) | Student (RGB-only, CD 0.242 mean / 0.119 median) | Ground Truth

VAE v3 Reconstruction

Panels: Input (LiDAR scan) | VAE v3 Reconstruction | Ground Truth

Models

v2 GT (ICP-refined ground truth, 50x denser)

Model	Path	CD	Input	Description
Teacher v2GT	`teacher_v2gt/best_model.pth`	0.039 +/- 0.009	LiDAR	Direct diffusion, frozen Sonata/PTv3 encoder (108M) + DenoisingNetwork (8.9M)
Student v2GT	`student_v2gt/best_model.pth`	0.242 +/- 0.266	RGB (DA2 pseudo-depth)	Task-loss-only distillation from teacher

v1 GT (original accumulated ground truth)

Model	Path	CD	Input	Description
Teacher v1GT	`teacher_v1gt/best_model.pth`	0.608 +/- 0.142	LiDAR	Same architecture, trained on v1 GT
Student v1GT	`student_v1gt/best_model.pth`	0.738 +/- 0.245	RGB (DA2 pseudo-depth)	Task-loss-only distillation from v1 teacher

VAE (reconstruction, not scene completion)

Model	Path	CD	Description
VAE v3	`vae_v3/best_point_vae.pth`	0.120 +/- 0.026	VecSet-style cross-attention, 32 tokens, 7.1M params

Architecture

Encoder: Frozen Sonata/PTv3 (108M params, pretrained)
Denoiser: DenoisingNetwork (8.9M trainable params)
Diffusion: 1000 timesteps, cosine schedule, epsilon-prediction
Inference: Single-step x0 prediction at t=200
Distillation: Task-loss-only (alignment losses are harmful due to cross-modal gradient conflicts)

Key Finding: Gradient Conflicts in Cross-Modal Distillation

Standard multi-loss distillation (task + output matching + feature alignment + structural) fails for cross-modal generative distillation.

Feature alignment gradients conflict with task loss gradients (cosine similarity = -0.023, 58% of batches negative)
Structural loss gradient magnitude is 18x larger than task loss, drowning the task signal
Task-loss-only achieves the best student performance

Sensor Degradation Robustness

The student (RGB-based) is inherently robust to LiDAR sensor degradation since it does not use LiDAR at inference. The teacher degrades under angular sector occlusion, with crossover at ~13 degrees. For beam dropout, Gaussian noise, and depth estimation noise, the teacher remains robust.

Dataset

SemanticKITTI outdoor driving scenes. v2 GT uses anchor-based ICP refinement (50x denser, 5-10x tighter bounding boxes).

Training Details

Model	Epochs	LR	Batch Size	GPU
Teacher v2GT	30	1e-4	2	RTX 4090 24GB
Student v2GT	15	1e-4	2	RTX 4090 24GB
VAE v3	100	3e-4	4	RTX 4090 24GB

Citation

Paper submitted to IEEE SMC 2026.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support