arxiv:2601.03665

PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

Published on Jan 7

Authors:

Abstract

A video generation framework integrates learnable physics priors through a predictor network that extracts physical features from diffusion latents and injects them into a transformer-based generator via cross-attention mechanisms.

AI-generated summary

Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.03665 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.03665 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.03665 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.