File size: 3,838 Bytes
ee03de9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
language:
- en
license: apache-2.0
tags:
- text-to-video
- lora
- physics
- cogvideox
- diffusion
- peft
- warp
- rigid-body
- fine-tuned
base_model: THUDM/CogVideoX-2b
pipeline_tag: text-to-video
library_name: diffusers
---

# PhysicsDrivenWorld (PDW)
### Physics-Corrected Video Generation via Warp-Guided LoRA Fine-Tuning

> **CogVideoX-2b + LoRA (r=16) Β· NVIDIA Warp Physics Β· Single H100 NVL**

---

## Key Result

| Metric | Base CogVideoX-2b | PDW (Ours) | Improvement |
|--------|:-----------------:|:----------:|:-----------:|
| Diffusion MSE β€” test_medium | 2.2676 | 0.3861 | **+83.0%** |
| Diffusion MSE β€” test_very_high | 2.2763 | 0.3790 | **+83.4%** |
| **Average** | **2.272** | **0.383** | **+83.2%** |

The fine-tuned model predicts noise on physics-correct reference frames **83.2% more accurately** than the base model, confirming that the Warp physics prior was successfully injected into the denoising weights.

---

## Model Description

**PhysicsDrivenWorld (PDW)** fine-tunes [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) using **Low-Rank Adaptation (LoRA)** supervised by an **NVIDIA Warp** rigid-body physics simulator.

Modern video diffusion models generate visually plausible but physically inconsistent results β€” objects float, bounce unrealistically, or violate Newton's laws. PDW injects a physics prior into the model's denoising weights by training on Warp-simulated ground-truth trajectories.

The training objective is standard **diffusion denoising MSE**, but applied exclusively to frames that are **physically correct by construction** from the Warp simulator β€” so the model learns to denoise physics-consistent content better than physics-inconsistent content.

---

## Architecture

| Component | Details |
|-----------|---------|
| **Base Model** | CogVideoX-2b (2B parameter text-to-video diffusion transformer) |
| **Adapter** | LoRA β€” rank r=16, alpha=32 |
| **Target Modules** | `to_q`, `to_k`, `to_v`, `to_out.0` (attention projections) |
| **Trainable Params** | ~3.7M of 2B total (0.185%) |
| **Physics Engine** | NVIDIA Warp 1.11.1 β€” GPU-accelerated rigid body simulator |
| **Simulation** | Semi-implicit Euler, 60 Hz, ground collision with restitution |
| **Training Loss** | Diffusion MSE on Warp-generated physics-correct frames |
| **LR Schedule** | 10-step linear warmup (1e-6 β†’ 1e-4) then cosine decay to 1e-6 |
| **Hardware** | Single NVIDIA H100 NVL (99.9 GB VRAM) β€” 13.9 GB peak usage |

---

## Training

### Hyperparameters

| Hyperparameter | Value |
|---------------|-------|
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Peak learning rate | 1e-4 |
| Optimiser | AdamW (Ξ²=(0.9, 0.999), Ξ΅=1e-8, weight_decay=0.01) |
| Training steps | 200 (5 epochs Γ— 40 steps) |
| Batch size | 1 |
| Diffusion timesteps | DDPMScheduler (1000 steps), random t ∈ [50, 950] |
| Precision | bfloat16 |
| Gradient clipping | 1.0 |

### Training Data β€” Warp Physics Scenarios

Training uses **synthetic videos rendered from NVIDIA Warp rigid-body simulations**, not real-world video. This eliminates dataset bias and provides ground-truth physically-correct trajectories as supervision.

| Scenario | Drop Height | Restitution | Physics Behaviour |
|----------|:-----------:|:-----------:|-------------------|
| ball_drop_low | 2m | 0.70 | Low-energy drop, high bounce |
| ball_drop_high | 5m | 0.60 | Standard gravity, moderate bounce |
| ball_elastic | 3m | 0.85 | Very elastic β€” multiple high bounces |
| ball_heavy | 4m | 0.30 | Inelastic β€” dead stop after first bounce |

### Convergence

| Epoch | Avg Loss | Notes |
|-------|----------|-------|
| 1 | 1.512 | Warmup spike β€” expected |
| 2 | ~0.45 | Fast learning |
| 5 | **0.341** | Converged β€” 77% drop from epoch 1 |

---

## How to Use

### Load the Model
```python