| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - text-to-video |
| | - lora |
| | - physics |
| | - cogvideox |
| | - diffusion |
| | - peft |
| | - warp |
| | - rigid-body |
| | - fine-tuned |
| | base_model: THUDM/CogVideoX-2b |
| | pipeline_tag: text-to-video |
| | library_name: diffusers |
| | --- |
| | |
| | # PhysicsDrivenWorld (PDW) |
| | ### Physics-Corrected Video Generation via Warp-Guided LoRA Fine-Tuning |
| |
|
| | > **CogVideoX-2b + LoRA (r=16) Β· NVIDIA Warp Physics Β· Single H100 NVL** |
| |
|
| | --- |
| |
|
| | ## Key Result |
| |
|
| | | Metric | Base CogVideoX-2b | PDW (Ours) | Improvement | |
| | |--------|:-----------------:|:----------:|:-----------:| |
| | | Diffusion MSE β test_medium | 2.2676 | 0.3861 | **+83.0%** | |
| | | Diffusion MSE β test_very_high | 2.2763 | 0.3790 | **+83.4%** | |
| | | **Average** | **2.272** | **0.383** | **+83.2%** | |
| | |
| | The fine-tuned model predicts noise on physics-correct reference frames **83.2% more accurately** than the base model, confirming that the Warp physics prior was successfully injected into the denoising weights. |
| | |
| | --- |
| | |
| | ## Model Description |
| | |
| | **PhysicsDrivenWorld (PDW)** fine-tunes [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) using **Low-Rank Adaptation (LoRA)** supervised by an **NVIDIA Warp** rigid-body physics simulator. |
| | |
| | Modern video diffusion models generate visually plausible but physically inconsistent results β objects float, bounce unrealistically, or violate Newton's laws. PDW injects a physics prior into the model's denoising weights by training on Warp-simulated ground-truth trajectories. |
| | |
| | The training objective is standard **diffusion denoising MSE**, but applied exclusively to frames that are **physically correct by construction** from the Warp simulator β so the model learns to denoise physics-consistent content better than physics-inconsistent content. |
| | |
| | --- |
| | |
| | ## Architecture |
| | |
| | | Component | Details | |
| | |-----------|---------| |
| | | **Base Model** | CogVideoX-2b (2B parameter text-to-video diffusion transformer) | |
| | | **Adapter** | LoRA β rank r=16, alpha=32 | |
| | | **Target Modules** | `to_q`, `to_k`, `to_v`, `to_out.0` (attention projections) | |
| | | **Trainable Params** | ~3.7M of 2B total (0.185%) | |
| | | **Physics Engine** | NVIDIA Warp 1.11.1 β GPU-accelerated rigid body simulator | |
| | | **Simulation** | Semi-implicit Euler, 60 Hz, ground collision with restitution | |
| | | **Training Loss** | Diffusion MSE on Warp-generated physics-correct frames | |
| | | **LR Schedule** | 10-step linear warmup (1e-6 β 1e-4) then cosine decay to 1e-6 | |
| | | **Hardware** | Single NVIDIA H100 NVL (99.9 GB VRAM) β 13.9 GB peak usage | |
| | |
| | --- |
| | |
| | ## Training |
| | |
| | ### Hyperparameters |
| | |
| | | Hyperparameter | Value | |
| | |---------------|-------| |
| | | LoRA rank (r) | 16 | |
| | | LoRA alpha | 32 | |
| | | LoRA dropout | 0.05 | |
| | | Peak learning rate | 1e-4 | |
| | | Optimiser | AdamW (Ξ²=(0.9, 0.999), Ξ΅=1e-8, weight_decay=0.01) | |
| | | Training steps | 200 (5 epochs Γ 40 steps) | |
| | | Batch size | 1 | |
| | | Diffusion timesteps | DDPMScheduler (1000 steps), random t β [50, 950] | |
| | | Precision | bfloat16 | |
| | | Gradient clipping | 1.0 | |
| |
|
| | ### Training Data β Warp Physics Scenarios |
| |
|
| | Training uses **synthetic videos rendered from NVIDIA Warp rigid-body simulations**, not real-world video. This eliminates dataset bias and provides ground-truth physically-correct trajectories as supervision. |
| |
|
| | | Scenario | Drop Height | Restitution | Physics Behaviour | |
| | |----------|:-----------:|:-----------:|-------------------| |
| | | ball_drop_low | 2m | 0.70 | Low-energy drop, high bounce | |
| | | ball_drop_high | 5m | 0.60 | Standard gravity, moderate bounce | |
| | | ball_elastic | 3m | 0.85 | Very elastic β multiple high bounces | |
| | | ball_heavy | 4m | 0.30 | Inelastic β dead stop after first bounce | |
| |
|
| | ### Convergence |
| |
|
| | | Epoch | Avg Loss | Notes | |
| | |-------|----------|-------| |
| | | 1 | 1.512 | Warmup spike β expected | |
| | | 2 | ~0.45 | Fast learning | |
| | | 5 | **0.341** | Converged β 77% drop from epoch 1 | |
| |
|
| | --- |
| |
|
| | ## How to Use |
| |
|
| | ### Load the Model |
| | ```python |