LiDAR-Perfect-Depth / code /LPD_README.md
chenming-wu's picture
code
436b829 verified

LiDAR-Perfect Depth (LPD) — code on top of Pixel-Perfect Depth

LiDAR-prompted, score-decomposed diffusion with Kalman-in-the-loop denoising for sparse-prompted depth. Built directly on top of the public PPD codebase — the original PPD modules (DiT, semantics encoder, schedule, sampler, data loaders) are reused unchanged; everything new lives under ppd/lpd/.

What this adds on top of PPD

File Role
ppd/lpd/sparse_simulator.py Random / scan-line / grid / hybrid sparse-LiDAR simulation from dense GT (paper §4.1)
ppd/lpd/prompt_encoder.py Multi-scale masked-avg-pool sparse-prompt encoder + per-token density signal (paper §3.1)
ppd/lpd/prompt_gate.py Noise-level-conditioned mixer + sigmoid gate, zero-initialized so an untrained model behaves like PPD (paper §3.1)
ppd/lpd/lpd_dit.py LPDDiT(DiT) — drop-in replacement for DiT that injects sparse-prompt tokens at the same midpoint where PPD fuses semantics. Has freeze_backbone() so only the prompt branch trains (paper §3.6)
ppd/lpd/posterior_projection.py Score decomposition / projection step — Eq. 5
ppd/lpd/kalman_in_loop.py Algorithm 1 — within-denoising Kalman state estimator with monotonically-decreasing variance
ppd/lpd/temporal_kalman.py Per-pixel video Kalman filter: optical-flow warp predict, sparse-LiDAR update, forward-backward occlusion detection (paper §3.4)
ppd/lpd/uncertainty_modulation.py Uncertainty-guided prompt modulation — Eq. 7
ppd/lpd/losses.py Sparse-anchor consistency loss
ppd/lpd/lpd_train.py Trainer/inferencer mirroring ppd/models/ppd_train.py
ppd/lpd/lpd_video.py Sequence-level inference: chains the temporal Kalman filter between frames, computes RAFT flow on the fly
ppd/data/hypersim_lpd.py Hypersim adapter for the HarrisonPENG/hypersim mirror (.npy depth, scene/cam_NN/NNNNNN_rgb.png layout)
ppd/data/video_clip.py Generic video-clip dataset for TartanAir / Bonn
ppd/configs/lpd_pretrain.yaml 512² Hypersim pretrain (only prompt encoder + gate trainable)
ppd/configs/lpd_finetune.yaml 1024×768 mixed-dataset fine-tune (Hypersim + UrbanSyn + UnrealStereo4K + VKITTI2 + TartanAir)
train_lpd.sh, run_lpd_video.py Entry-point shell + video-demo CLI

Trainable footprint

With freeze_backbone=True (default), training updates only sparse_prompt_encoder.* and prompt_gate.* — about 16 M / 820 M parameters (≈ 2 %). The rest of the DiT stays at its PPD-pretrained values, so training is fast and a single-machine setup is enough for verification.

Datasets

All paths point to /mnt/sig/datasets/ — the layout produced by the _logs/download_*.sh scripts in that directory:

pretrained/ppd/ppd.pth
pretrained/depth_anything_v2/depth_anything_v2_vitl.pth   (semantics encoder)
aux/raft/raft_large_C_T_SKHT_V2-ff5fadd5.pth              (video flow)
train/hypersim/extracted/<scene>/cam_NN/<frame>_{rgb,depth}.{png,npy}
train/vkitti2/extracted/Scene<NN>/...
train/tartanair/extracted/<scene>/{Easy,Hard}/P###/{image,depth}_left/...
train/urbansyn/...
train/unrealstereo4k/...
eval_image/{nyuv2,kitti,eth3d,diode,scannet}/...
eval_video/{bonn_rgbd,sintel,arkitscenes}/...

ScanNet requires a signed TOS — not auto-downloadable. See /mnt/sig/datasets/README.md for the manual workflow.

Verified end-to-end

  • LPDDiT loads PPD weights cleanly (only prompt branches reported missing, no unexpected keys).
  • Synthetic + real Hypersim batches → forward / backward / optimizer step OK (loss drops 0.011 → 0.006 in 3 steps at 512², peak mem ~ 8 GB on bf16).
  • forward_test runs the Kalman-in-loop sampler end-to-end with stable numerics (depth in [-0.02, 1.21] after un-normalize, variance ~ 0.04).
  • pytorch_lightning Trainer runs through the full datamodule + module
    • checkpoint callback wiring (exp_name=lpd_smoke).

Running

# Stage 0: env
pip install -r requirements.txt
ln -sf /mnt/sig/datasets/pretrained/ppd/ppd.pth checkpoints/ppd.pth
ln -sf /mnt/sig/datasets/pretrained/depth_anything_v2/depth_anything_v2_vitl.pth \
       checkpoints/depth_anything_v2_vitl.pth

# Stage 1: image pretrain (Hypersim only, 512²)
bash train_lpd.sh

# Stage 2: image fine-tune (mixed, 1024×768)
python main.py --cfg_file ppd/configs/lpd_finetune.yaml pl_trainer.devices=8

# Video inference demo (Bonn dynamic sequence)
python run_lpd_video.py \
    --sequence /mnt/sig/datasets/eval_video/bonn_rgbd/rgbd_bonn_balloon \
    --weights checkpoints/ppd.pth \
    --out outputs/bonn_balloon

Notes vs paper

  • R_proj = 0.1, proj_alpha = 0.1 — paper §4.4 reports R_proj = 0.1; the step-size scale α is left implicit in the paper. 0.1 keeps the projection numerically stable across the schedule (with α = 1.0 early steps blow up on small-R likelihoods).
  • sparse.density = 0.005 ≈ a typical Velodyne 64 sweep density; can be raised for indoor settings with denser ToF.
  • The prompt-branch parameter count (~16 M) is higher than the paper's ~830 K because we use the full multi-scale CNN encoder. To shrink, drop prompt_hidden from 128 → 32 and reduce the number of scales.

Differences from PPD source

Only two existing PPD files were touched:

  1. ppd/data/general_datamodule.py:9-17mix_datasets now allows over-sampling when the requested per-dataset count exceeds the dataset size (needed when running with very small extracted subsets). The default non-oversample path is unchanged.

Everything else lives in new files under ppd/lpd/, ppd/data/hypersim_lpd.py, ppd/data/video_clip.py, and ppd/configs/lpd_*.yaml.