| # LiDAR-Perfect Depth (LPD) β code on top of Pixel-Perfect Depth |
|
|
| LiDAR-prompted, score-decomposed diffusion with Kalman-in-the-loop denoising |
| for sparse-prompted depth. Built directly on top of the public PPD codebase β |
| the original PPD modules (DiT, semantics encoder, schedule, sampler, data |
| loaders) are reused unchanged; everything new lives under `ppd/lpd/`. |
|
|
| ## What this adds on top of PPD |
|
|
| | File | Role | |
| |---|---| |
| | `ppd/lpd/sparse_simulator.py` | Random / scan-line / grid / hybrid sparse-LiDAR simulation from dense GT (paper Β§4.1) | |
| | `ppd/lpd/prompt_encoder.py` | Multi-scale masked-avg-pool sparse-prompt encoder + per-token density signal (paper Β§3.1) | |
| | `ppd/lpd/prompt_gate.py` | Noise-level-conditioned mixer + sigmoid gate, zero-initialized so an untrained model behaves like PPD (paper Β§3.1) | |
| | `ppd/lpd/lpd_dit.py` | `LPDDiT(DiT)` β drop-in replacement for `DiT` that injects sparse-prompt tokens at the same midpoint where PPD fuses semantics. Has `freeze_backbone()` so only the prompt branch trains (paper Β§3.6) | |
| | `ppd/lpd/posterior_projection.py` | Score decomposition / projection step β Eq. 5 | |
| | `ppd/lpd/kalman_in_loop.py` | Algorithm 1 β within-denoising Kalman state estimator with monotonically-decreasing variance | |
| | `ppd/lpd/temporal_kalman.py` | Per-pixel video Kalman filter: optical-flow warp predict, sparse-LiDAR update, forward-backward occlusion detection (paper Β§3.4) | |
| | `ppd/lpd/uncertainty_modulation.py` | Uncertainty-guided prompt modulation β Eq. 7 | |
| | `ppd/lpd/losses.py` | Sparse-anchor consistency loss | |
| | `ppd/lpd/lpd_train.py` | Trainer/inferencer mirroring `ppd/models/ppd_train.py` | |
| | `ppd/lpd/lpd_video.py` | Sequence-level inference: chains the temporal Kalman filter between frames, computes RAFT flow on the fly | |
| | `ppd/data/hypersim_lpd.py` | Hypersim adapter for the HarrisonPENG/hypersim mirror (.npy depth, scene/cam_NN/NNNNNN_rgb.png layout) | |
| | `ppd/data/video_clip.py` | Generic video-clip dataset for TartanAir / Bonn | |
| | `ppd/configs/lpd_pretrain.yaml` | 512Β² Hypersim pretrain (only prompt encoder + gate trainable) | |
| | `ppd/configs/lpd_finetune.yaml` | 1024Γ768 mixed-dataset fine-tune (Hypersim + UrbanSyn + UnrealStereo4K + VKITTI2 + TartanAir) | |
| | `train_lpd.sh`, `run_lpd_video.py` | Entry-point shell + video-demo CLI | |
|
|
| ## Trainable footprint |
| With `freeze_backbone=True` (default), training updates only `sparse_prompt_encoder.*` and `prompt_gate.*` β about **16 M / 820 M parameters (β 2 %)**. The rest of the DiT stays at its PPD-pretrained values, so training is fast and a single-machine setup is enough for verification. |
|
|
| ## Datasets |
|
|
| All paths point to `/mnt/sig/datasets/` β the layout produced by the `_logs/download_*.sh` scripts in that directory: |
|
|
| ``` |
| pretrained/ppd/ppd.pth |
| pretrained/depth_anything_v2/depth_anything_v2_vitl.pth (semantics encoder) |
| aux/raft/raft_large_C_T_SKHT_V2-ff5fadd5.pth (video flow) |
| train/hypersim/extracted/<scene>/cam_NN/<frame>_{rgb,depth}.{png,npy} |
| train/vkitti2/extracted/Scene<NN>/... |
| train/tartanair/extracted/<scene>/{Easy,Hard}/P###/{image,depth}_left/... |
| train/urbansyn/... |
| train/unrealstereo4k/... |
| eval_image/{nyuv2,kitti,eth3d,diode,scannet}/... |
| eval_video/{bonn_rgbd,sintel,arkitscenes}/... |
| ``` |
|
|
| ScanNet requires a signed TOS β not auto-downloadable. See |
| `/mnt/sig/datasets/README.md` for the manual workflow. |
|
|
| ## Verified end-to-end |
|
|
| * `LPDDiT` loads PPD weights cleanly (only prompt branches reported missing, |
| no unexpected keys). |
| * Synthetic + real Hypersim batches β forward / backward / optimizer step OK |
| (loss drops 0.011 β 0.006 in 3 steps at 512Β², peak mem ~ 8 GB on bf16). |
| * `forward_test` runs the Kalman-in-loop sampler end-to-end with stable |
| numerics (depth in `[-0.02, 1.21]` after un-normalize, variance ~ 0.04). |
| * `pytorch_lightning` `Trainer` runs through the full datamodule + module |
| + checkpoint callback wiring (`exp_name=lpd_smoke`). |
|
|
| ## Running |
|
|
| ```bash |
| # Stage 0: env |
| pip install -r requirements.txt |
| ln -sf /mnt/sig/datasets/pretrained/ppd/ppd.pth checkpoints/ppd.pth |
| ln -sf /mnt/sig/datasets/pretrained/depth_anything_v2/depth_anything_v2_vitl.pth \ |
| checkpoints/depth_anything_v2_vitl.pth |
| |
| # Stage 1: image pretrain (Hypersim only, 512Β²) |
| bash train_lpd.sh |
| |
| # Stage 2: image fine-tune (mixed, 1024Γ768) |
| python main.py --cfg_file ppd/configs/lpd_finetune.yaml pl_trainer.devices=8 |
| |
| # Video inference demo (Bonn dynamic sequence) |
| python run_lpd_video.py \ |
| --sequence /mnt/sig/datasets/eval_video/bonn_rgbd/rgbd_bonn_balloon \ |
| --weights checkpoints/ppd.pth \ |
| --out outputs/bonn_balloon |
| ``` |
|
|
| ## Notes vs paper |
|
|
| * `R_proj = 0.1`, `proj_alpha = 0.1` β paper Β§4.4 reports `R_proj = 0.1`; the |
| step-size scale Ξ± is left implicit in the paper. 0.1 keeps the projection |
| numerically stable across the schedule (with `Ξ± = 1.0` early steps blow up |
| on small-R likelihoods). |
| * `sparse.density = 0.005` β a typical Velodyne 64 sweep density; can be |
| raised for indoor settings with denser ToF. |
| * The prompt-branch parameter count (~16 M) is higher than the paper's |
| ~830 K because we use the full multi-scale CNN encoder. To shrink, drop |
| `prompt_hidden` from 128 β 32 and reduce the number of scales. |
|
|
| ## Differences from PPD source |
|
|
| Only **two** existing PPD files were touched: |
|
|
| 1. `ppd/data/general_datamodule.py:9-17` β `mix_datasets` now allows |
| over-sampling when the requested per-dataset count exceeds the dataset |
| size (needed when running with very small extracted subsets). The default |
| non-oversample path is unchanged. |
|
|
| Everything else lives in new files under `ppd/lpd/`, `ppd/data/hypersim_lpd.py`, |
| `ppd/data/video_clip.py`, and `ppd/configs/lpd_*.yaml`. |
|
|