Upload folder using huggingface_hub
Browse files- README.md +47 -0
- student_v1gt/best_model.pth +3 -0
- student_v2gt/best_model.pth +3 -0
- teacher_v1gt/best_model.pth +3 -0
- teacher_v2gt/best_model.pth +3 -0
README.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sonata Scene Completion: Diffusion-Based 3D Scene Completion from LiDAR and RGB
|
| 2 |
+
|
| 3 |
+
Checkpoints for cross-modal diffusion-based 3D scene completion on SemanticKITTI.
|
| 4 |
+
|
| 5 |
+
## Models
|
| 6 |
+
|
| 7 |
+
### v2 GT (ICP-refined ground truth, 50x denser)
|
| 8 |
+
|
| 9 |
+
| Model | Path | CD | Input | Description |
|
| 10 |
+
|-------|------|-----|-------|-------------|
|
| 11 |
+
| **Teacher v2GT** | `teacher_v2gt/best_model.pth` | **0.039 +/- 0.009** | LiDAR | Direct diffusion, frozen Sonata/PTv3 encoder (108M) + DenoisingNetwork (8.9M) |
|
| 12 |
+
| **Student v2GT** | `student_v2gt/best_model.pth` | **0.242 +/- 0.263** | RGB (DA2 pseudo-depth) | Task-loss-only distillation from teacher |
|
| 13 |
+
|
| 14 |
+
### v1 GT (original accumulated ground truth)
|
| 15 |
+
|
| 16 |
+
| Model | Path | CD | Input | Description |
|
| 17 |
+
|-------|------|-----|-------|-------------|
|
| 18 |
+
| Teacher v1GT | `teacher_v1gt/best_model.pth` | 0.608 +/- 0.141 | LiDAR | Same architecture, trained on v1 GT |
|
| 19 |
+
| Student v1GT | `student_v1gt/best_model.pth` | 0.721 +/- 0.167 | RGB (DA2 pseudo-depth) | Task-loss-only distillation from v1 teacher |
|
| 20 |
+
|
| 21 |
+
## Architecture
|
| 22 |
+
|
| 23 |
+
- **Encoder**: Frozen Sonata/PTv3 (108M params, pretrained)
|
| 24 |
+
- **Denoiser**: DenoisingNetwork (8.9M trainable params)
|
| 25 |
+
- **Diffusion**: 1000 timesteps, cosine schedule, epsilon-prediction
|
| 26 |
+
- **Inference**: Single-step x0 prediction at t=200
|
| 27 |
+
- **Distillation**: Task-loss-only (alignment losses are harmful due to cross-modal gradient conflicts)
|
| 28 |
+
|
| 29 |
+
## Key Finding
|
| 30 |
+
|
| 31 |
+
Standard multi-loss distillation (task + output matching + feature alignment + structural) fails for cross-modal generative distillation. Feature alignment gradients conflict with task loss gradients (cosine similarity = -0.023, 58% of batches negative). Task-loss-only achieves the best student performance.
|
| 32 |
+
|
| 33 |
+
## Dataset
|
| 34 |
+
|
| 35 |
+
SemanticKITTI outdoor driving scenes. v2 GT uses anchor-based ICP refinement (50x denser, 5-10x tighter bounding boxes).
|
| 36 |
+
|
| 37 |
+
## Training Details
|
| 38 |
+
|
| 39 |
+
| Model | Epochs | LR | Batch Size | GPU |
|
| 40 |
+
|-------|--------|-----|-----------|-----|
|
| 41 |
+
| Teacher v2GT | 30 | 1e-4 | 2 | RTX 4090 24GB |
|
| 42 |
+
| Student v2GT | 15 | 1e-4 | 2 | RTX 4090 24GB |
|
| 43 |
+
| VAE v3 | 100 | 3e-4 | 4 | RTX 4090 24GB |
|
| 44 |
+
|
| 45 |
+
## Citation
|
| 46 |
+
|
| 47 |
+
Paper submitted to IEEE SMC 2026.
|
student_v1gt/best_model.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:10c3a89feace41810f91b984cf25be65794585ab0739c3b78cdda959e355074f
|
| 3 |
+
size 541603043
|
student_v2gt/best_model.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7a9f4eccb57fe4115f34d68978a924de7afae0add52c4d8fb4b24dc9540d1a48
|
| 3 |
+
size 541600675
|
teacher_v1gt/best_model.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:984bb76c64b9742821ab10bdf6bb5e31ce6989dcd12e514133d37219754ad7ec
|
| 3 |
+
size 541602211
|
teacher_v2gt/best_model.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eb599c5222cb80cd964a555677a48ee52359e7706a288f3f6865ebc1221f6c48
|
| 3 |
+
size 541602211
|