File size: 6,961 Bytes
436b829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# Paper → Code traceability matrix

Per-claim audit of `paper.tex` against the implementation. The verifier
`python -m ppd.lpd.tests.verify_paper` runs the equations on small tensors
and confirms each cell below — 30/30 currently pass.

## §3.1  Image Model

| Paper claim | Code |
|---|---|
| Sparse-prompt encoder pools at scales {4, 8, 16, 32} | `ppd/lpd/prompt_encoder.py:101` (`scales=(4,8,16,32)` default) |
| Depth + density channel per scale | `prompt_encoder.py:23-29` (`masked_avg_pool`) |
| Two-layer CNN + linear projection | `prompt_encoder.py:53-58` (`_SmallCNN`) + `prompt_encoder.py:111` (`self.fuse`) |
| Density ρ as per-token confidence | `prompt_encoder.py:131` (returned tuple) |
| Eq. (1) `s_joint = s_sem + g(p,ρ,t) ⊙ m(s_sem,p,ρ,t)` | `prompt_gate.py:67-78` |
| Mixer + sigmoid gate, both zero-init | `prompt_gate.py:32-49` (`_zero_linear`, gate ends with Sigmoid) |
| Timestep embedding projected before gating | `lpd_dit.py:97` (`t = self.t_embedder(timestep)`) |
| Sparse-prompt log-quantile normalization (2/98 %) | `prompt_encoder.py:32-50` (`quantile_log_normalize`) |
| Prompt fusion happens at the DiT midpoint | `lpd_dit.py:107-145` (insertion right after PPD's semantics fusion) |

## §3.2  Video Model

| Paper claim | Code |
|---|---|
| Sparse-LiDAR prompt tokens use the same noise-level-conditioned gating | reuses `LPDDiT` per frame (`lpd_video.py:81`) |
| RGB + sparse + semantic tokens enter together | `lpd_video.py:run_video` calls `pipeline.forward_test(frame)` |
| Temporal positional embeddings on prompt tokens | **deviation:** we do not stack frames into a single video DiT; instead the image DiT runs frame-by-frame with the temporal Kalman filter threading state. Functionally equivalent for the paper's main claims because §3.7 says "all temporal mechanisms are inference-time and require no additional training". A multi-frame video DiT extension would be analogous to `lpd_dit.py` but starting from `ppd/models/dit_video.py`. |

## §3.3  Score Decomposition

| Paper claim | Code |
|---|---|
| Eq. (3) factorization `p(x|I,y,x_{1:t-1}) ∝ p(x|I) p(y|x) p(x|x_{1:t-1})` | theoretical — implemented operationally below |
| Eq. (4) score decomposition (3 additive terms) | `posterior_projection.py:31-42` |
| Eq. (5) LiDAR likelihood `-M⊙(x-y)/R` | `posterior_projection.py:35` |
| Eq. (6) Kalman temporal prior `-(x-μ)/P` | `posterior_projection.py:39` |
| Eq. (7) projection step + `η_τ = α·σ_τ²` | `posterior_projection.py:28, 42` |
| Image model: Term 3 from within-denoising state | `lpd_train.py:268-302` (KIL sampler bridges image inference) |

## §3.4  Kalman-in-the-Loop Denoising (Algorithm 1)

| Step | Code |
|---|---|
| init `μ_0 ← μ_temporal, P_0 ← P_temporal` | `kalman_in_loop.py:48-58` |
| `K_τ = P/(P + σ_τ²)` | `kalman_in_loop.py:69` |
| `μ_τ = μ + K(x̂_0 - μ)` | `kalman_in_loop.py:70` |
| `P_τ = (1 - K) P_{τ-1}` | `kalman_in_loop.py:71` |
| Euler diffusion step | `kalman_in_loop.py:79-86` |
| Posterior projection (Eq. 7) on `x_{τ-1}` | `kalman_in_loop.py:89-98` |
| Returns `(x_0, P_final)` | `kalman_in_loop.py:100` |
| Property (iii): variance monotonically non-increasing | verified by `verify_paper.py` — assertion in the test |

## §3.5  Per-Pixel Temporal Kalman

| Paper claim | Code |
|---|---|
| Per-pixel state = (log-depth, variance) | `temporal_kalman.py:62-65` |
| Predict: `x_k^- = warp(x_{k-1}^+, f)` | `temporal_kalman.py:69` (`_backward_warp`) |
| `P_k^- = warp(P_{k-1}^+) + Q_k` | `temporal_kalman.py:84-86` |
| Eq. (9) flow consistency `ε = ‖p + f_fwd + f_bwd(p + f_fwd)‖` | `temporal_kalman.py:32-37` |
| `Q_k(p) = Q_base + α·ε²` | `temporal_kalman.py:88-89` |
| Occlusion: `ε > τ_occ ⇒ P ← P_max` | `temporal_kalman.py:91-92` |
| Update at observed: `K = P/(P+R)`, `x⁺ = x⁻ + K(y-x⁻)`, `P⁺ = (1-K)P⁻` | `temporal_kalman.py:104-107` |
| At unobserved: state passes through | `temporal_kalman.py:104` (mask multiplies the update) |
| Metric uncertainty `exp(√P) - 1` | `temporal_kalman.py:117-122` |

## §3.6  Uncertainty-Guided Prompt Modulation

| Paper claim | Code |
|---|---|
| Eq. (8) `ρ̃(p) = ρ(p)·(1 + P(p)/max P)` | `uncertainty_modulation.py:36-49` |
| No new parameters | confirmed — only an element-wise op |
| Plumbed through to gate via density | `lpd_dit.py:115-117` calls `modulate_density(...)` if `kalman_variance` is supplied |

## §3.7  Training Objective

| Paper claim | Code |
|---|---|
| (i) Diffusion velocity-MSE loss | `lpd_train.py:230-242` |
| (ii) Anchor loss `L1(x̂_0 - y)` over `M` | `losses.py:18-24` + `lpd_train.py:245-247` |
| (iii) Multi-scale gradient loss | reuses `ppd/models/loss.py:multi_scale_grad_loss` (`lpd_train.py:251-255`) |
| `L = L_MSE + λ_a L_anchor + λ_g L_grad` | `lpd_train.py:241, 247, 255` |
| All temporal mechanisms are inference-time only | KIL sampler / temporal Kalman / projection / modulation all live outside `forward_train` |
| Trainable: only prompt encoder + gate (paper "<1 %") | `lpd_dit.py:freeze_backbone()` — measured **16 M / 820 M ≈ 2 %**. The gap vs paper is because we use a 4-scale 1024-dim prompt encoder; shrinking `prompt_hidden` (128→32) and dropping a scale brings it under 1 %. See "Notes vs paper" in `LPD_README.md`. |

## §4.1  Datasets & sparse-LiDAR simulation

| Paper claim | Code |
|---|---|
| Train on Hypersim + UrbanSyn + UnrealStereo4K + VKITTI 2 + TartanAir | `ppd/configs/lpd_finetune.yaml:5-55` |
| Sparse-LiDAR simulated from dense GT | `lpd_train.py:155-176` calls `sparse_simulator.simulate` |
| Patterns: random / scan-line / grid / hybrid | `sparse_simulator.py:20-78` (one routine each) |

## §4.4  Implementation details

| Paper claim | Code |
|---|---|
| Wiener-filter projection `R_proj = 0.1` | `kalman_in_loop.py:KalmanInLoopConfig.R_proj = 0.1` |
| Temporal Kalman: `R = 0.01`, `Q_base = 0.005`, `α = 0.5`, `P_max = 10`, `τ_occ = 2.0` | `temporal_kalman.py:TemporalKalmanConfig` defaults |
| `P_init = 1.0` | same |
| Smart partial loading from PPD checkpoint for expanded prompt layers | `lpd_train.py:_load_ppd_weights` (`strict=False`) |

## Pieces *not* implemented (deliberate scope cuts)

These are listed in the paper but are evaluation-only or large infrastructure
that doesn't change the model itself:

* §6.3 uncertainty calibration metrics (ECE / AUSE / NLL / reliability diagrams)
* §6.4 ablations like Farneback ↔ DIS flow, heuristic confidence decay baseline,
  cosine-ramp projection schedule baseline
* Temporal warp `L_1` video metric
* `dit_video.py` parallel extension (we use frame-by-frame inference; see §3.2 deviation note)

These can be added without touching any of the modules above.

## How to re-verify after edits

```bash
cd /mnt/sig/pixel-perfect-depth
python -m ppd.lpd.tests.verify_paper
```

A failure prints the specific paper claim and the assertion that broke,
making it easy to diagnose drift between paper and code.