Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them
Abstract
PhaseLock is a training-free framework that improves physical consistency in image-to-video diffusion models by preserving motion priors from early-step inference throughout the denoising process.
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by approx 18% from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead (1.06times time, 1.02times memory) and reduced reliance on expensive external guidance methods (sim5times time).
Community
TL;DR. A 2-step generation often has better physics than the full 50-step output. We trace this to phase erosion during denoising, and introduce PhaseLock — a training-free framework that locks the early motion prior into the final high-fidelity output via Latent Delta Guidance. +6.2 pts physical consistency, 1.06× time, 1.02× memory.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- {\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation (2026)
- Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation (2026)
- TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion (2026)
- LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation (2026)
- PhyCo: Learning Controllable Physical Priors for Generative Motion (2026)
- GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation (2026)
- FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
This is a cool finding. It’s counterintuitive that a 2-step generation would actually be more physically consistent than a full 50-step run, but the phase erosion point makes a lot of sense. I like that you can just lock in those priors without needing extra training or heavy guidance.
How much does the visual fidelity actually drop when you enforce those early motion priors? The abstract mentions it is mostly maintained, but I’m curious if there's a visible trade-off.
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4fbffe7c-1143-4b60-be39-530a988015a2
Get this paper in your agent:
hf papers read 2606.06361 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper