LiDAR-Perfect-Depth / code /LPD_README.md

code

436b829 verified 3 days ago

5.77 kB

	# LiDAR-Perfect Depth (LPD) — code on top of Pixel-Perfect Depth

	LiDAR-prompted, score-decomposed diffusion with Kalman-in-the-loop denoising
	for sparse-prompted depth. Built directly on top of the public PPD codebase —
	the original PPD modules (DiT, semantics encoder, schedule, sampler, data
	loaders) are reused unchanged; everything new lives under `ppd/lpd/`.

	## What this adds on top of PPD

	\| File \| Role \|
	\|---\|---\|
	\| `ppd/lpd/sparse_simulator.py` \| Random / scan-line / grid / hybrid sparse-LiDAR simulation from dense GT (paper §4.1) \|
	\| `ppd/lpd/prompt_encoder.py` \| Multi-scale masked-avg-pool sparse-prompt encoder + per-token density signal (paper §3.1) \|
	\| `ppd/lpd/prompt_gate.py` \| Noise-level-conditioned mixer + sigmoid gate, zero-initialized so an untrained model behaves like PPD (paper §3.1) \|
	\| `ppd/lpd/lpd_dit.py` \| `LPDDiT(DiT)` — drop-in replacement for `DiT` that injects sparse-prompt tokens at the same midpoint where PPD fuses semantics. Has `freeze_backbone()` so only the prompt branch trains (paper §3.6) \|
	\| `ppd/lpd/posterior_projection.py` \| Score decomposition / projection step — Eq. 5 \|
	\| `ppd/lpd/kalman_in_loop.py` \| Algorithm 1 — within-denoising Kalman state estimator with monotonically-decreasing variance \|
	\| `ppd/lpd/temporal_kalman.py` \| Per-pixel video Kalman filter: optical-flow warp predict, sparse-LiDAR update, forward-backward occlusion detection (paper §3.4) \|
	\| `ppd/lpd/uncertainty_modulation.py` \| Uncertainty-guided prompt modulation — Eq. 7 \|
	\| `ppd/lpd/losses.py` \| Sparse-anchor consistency loss \|
	\| `ppd/lpd/lpd_train.py` \| Trainer/inferencer mirroring `ppd/models/ppd_train.py` \|
	\| `ppd/lpd/lpd_video.py` \| Sequence-level inference: chains the temporal Kalman filter between frames, computes RAFT flow on the fly \|
	\| `ppd/data/hypersim_lpd.py` \| Hypersim adapter for the HarrisonPENG/hypersim mirror (.npy depth, scene/cam_NN/NNNNNN_rgb.png layout) \|
	\| `ppd/data/video_clip.py` \| Generic video-clip dataset for TartanAir / Bonn \|
	\| `ppd/configs/lpd_pretrain.yaml` \| 512² Hypersim pretrain (only prompt encoder + gate trainable) \|
	\| `ppd/configs/lpd_finetune.yaml` \| 1024×768 mixed-dataset fine-tune (Hypersim + UrbanSyn + UnrealStereo4K + VKITTI2 + TartanAir) \|
	\| `train_lpd.sh`, `run_lpd_video.py` \| Entry-point shell + video-demo CLI \|

	## Trainable footprint
	With `freeze_backbone=True` (default), training updates only `sparse_prompt_encoder.` and `prompt_gate.` — about 16 M / 820 M parameters (≈ 2 %). The rest of the DiT stays at its PPD-pretrained values, so training is fast and a single-machine setup is enough for verification.

	## Datasets

	All paths point to `/mnt/sig/datasets/` — the layout produced by the `_logs/download_*.sh` scripts in that directory:

	```
	pretrained/ppd/ppd.pth
	pretrained/depth_anything_v2/depth_anything_v2_vitl.pth (semantics encoder)
	aux/raft/raft_large_C_T_SKHT_V2-ff5fadd5.pth (video flow)
	train/hypersim/extracted/<scene>/cam_NN/<frame>_{rgb,depth}.{png,npy}
	train/vkitti2/extracted/Scene<NN>/...
	train/tartanair/extracted/<scene>/{Easy,Hard}/P###/{image,depth}_left/...
	train/urbansyn/...
	train/unrealstereo4k/...
	eval_image/{nyuv2,kitti,eth3d,diode,scannet}/...
	eval_video/{bonn_rgbd,sintel,arkitscenes}/...
	```

	ScanNet requires a signed TOS — not auto-downloadable. See
	`/mnt/sig/datasets/README.md` for the manual workflow.

	## Verified end-to-end

	* `LPDDiT` loads PPD weights cleanly (only prompt branches reported missing,
	no unexpected keys).
	* Synthetic + real Hypersim batches → forward / backward / optimizer step OK
	(loss drops 0.011 → 0.006 in 3 steps at 512², peak mem ~ 8 GB on bf16).
	* `forward_test` runs the Kalman-in-loop sampler end-to-end with stable
	numerics (depth in `[-0.02, 1.21]` after un-normalize, variance ~ 0.04).
	* `pytorch_lightning` `Trainer` runs through the full datamodule + module
	+ checkpoint callback wiring (`exp_name=lpd_smoke`).

	## Running

	```bash
	# Stage 0: env
	pip install -r requirements.txt
	ln -sf /mnt/sig/datasets/pretrained/ppd/ppd.pth checkpoints/ppd.pth
	ln -sf /mnt/sig/datasets/pretrained/depth_anything_v2/depth_anything_v2_vitl.pth \
	checkpoints/depth_anything_v2_vitl.pth

	# Stage 1: image pretrain (Hypersim only, 512²)
	bash train_lpd.sh

	# Stage 2: image fine-tune (mixed, 1024×768)
	python main.py --cfg_file ppd/configs/lpd_finetune.yaml pl_trainer.devices=8

	# Video inference demo (Bonn dynamic sequence)
	python run_lpd_video.py \
	--sequence /mnt/sig/datasets/eval_video/bonn_rgbd/rgbd_bonn_balloon \
	--weights checkpoints/ppd.pth \
	--out outputs/bonn_balloon
	```

	## Notes vs paper

	* `R_proj = 0.1`, `proj_alpha = 0.1` — paper §4.4 reports `R_proj = 0.1`; the
	step-size scale α is left implicit in the paper. 0.1 keeps the projection
	numerically stable across the schedule (with `α = 1.0` early steps blow up
	on small-R likelihoods).
	* `sparse.density = 0.005` ≈ a typical Velodyne 64 sweep density; can be
	raised for indoor settings with denser ToF.
	* The prompt-branch parameter count (~16 M) is higher than the paper's
	~830 K because we use the full multi-scale CNN encoder. To shrink, drop
	`prompt_hidden` from 128 → 32 and reduce the number of scales.

	## Differences from PPD source

	Only two existing PPD files were touched:

	1. `ppd/data/general_datamodule.py:9-17` — `mix_datasets` now allows
	over-sampling when the requested per-dataset count exceeds the dataset
	size (needed when running with very small extracted subsets). The default
	non-oversample path is unchanged.

	Everything else lives in new files under `ppd/lpd/`, `ppd/data/hypersim_lpd.py`,
	`ppd/data/video_clip.py`, and `ppd/configs/lpd_*.yaml`.