Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,106 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- text-to-video
|
| 7 |
+
- lora
|
| 8 |
+
- physics
|
| 9 |
+
- cogvideox
|
| 10 |
+
- diffusion
|
| 11 |
+
- peft
|
| 12 |
+
- warp
|
| 13 |
+
- rigid-body
|
| 14 |
+
- fine-tuned
|
| 15 |
+
base_model: THUDM/CogVideoX-2b
|
| 16 |
+
pipeline_tag: text-to-video
|
| 17 |
+
library_name: diffusers
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# PhysicsDrivenWorld (PDW)
|
| 21 |
+
### Physics-Corrected Video Generation via Warp-Guided LoRA Fine-Tuning
|
| 22 |
+
|
| 23 |
+
> **CogVideoX-2b + LoRA (r=16) · NVIDIA Warp Physics · Single H100 NVL**
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## Key Result
|
| 28 |
+
|
| 29 |
+
| Metric | Base CogVideoX-2b | PDW (Ours) | Improvement |
|
| 30 |
+
|--------|:-----------------:|:----------:|:-----------:|
|
| 31 |
+
| Diffusion MSE — test_medium | 2.2676 | 0.3861 | **+83.0%** |
|
| 32 |
+
| Diffusion MSE — test_very_high | 2.2763 | 0.3790 | **+83.4%** |
|
| 33 |
+
| **Average** | **2.272** | **0.383** | **+83.2%** |
|
| 34 |
+
|
| 35 |
+
The fine-tuned model predicts noise on physics-correct reference frames **83.2% more accurately** than the base model, confirming that the Warp physics prior was successfully injected into the denoising weights.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Model Description
|
| 40 |
+
|
| 41 |
+
**PhysicsDrivenWorld (PDW)** fine-tunes [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) using **Low-Rank Adaptation (LoRA)** supervised by an **NVIDIA Warp** rigid-body physics simulator.
|
| 42 |
+
|
| 43 |
+
Modern video diffusion models generate visually plausible but physically inconsistent results — objects float, bounce unrealistically, or violate Newton's laws. PDW injects a physics prior into the model's denoising weights by training on Warp-simulated ground-truth trajectories.
|
| 44 |
+
|
| 45 |
+
The training objective is standard **diffusion denoising MSE**, but applied exclusively to frames that are **physically correct by construction** from the Warp simulator — so the model learns to denoise physics-consistent content better than physics-inconsistent content.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Architecture
|
| 50 |
+
|
| 51 |
+
| Component | Details |
|
| 52 |
+
|-----------|---------|
|
| 53 |
+
| **Base Model** | CogVideoX-2b (2B parameter text-to-video diffusion transformer) |
|
| 54 |
+
| **Adapter** | LoRA — rank r=16, alpha=32 |
|
| 55 |
+
| **Target Modules** | `to_q`, `to_k`, `to_v`, `to_out.0` (attention projections) |
|
| 56 |
+
| **Trainable Params** | ~3.7M of 2B total (0.185%) |
|
| 57 |
+
| **Physics Engine** | NVIDIA Warp 1.11.1 — GPU-accelerated rigid body simulator |
|
| 58 |
+
| **Simulation** | Semi-implicit Euler, 60 Hz, ground collision with restitution |
|
| 59 |
+
| **Training Loss** | Diffusion MSE on Warp-generated physics-correct frames |
|
| 60 |
+
| **LR Schedule** | 10-step linear warmup (1e-6 → 1e-4) then cosine decay to 1e-6 |
|
| 61 |
+
| **Hardware** | Single NVIDIA H100 NVL (99.9 GB VRAM) — 13.9 GB peak usage |
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Training
|
| 66 |
+
|
| 67 |
+
### Hyperparameters
|
| 68 |
+
|
| 69 |
+
| Hyperparameter | Value |
|
| 70 |
+
|---------------|-------|
|
| 71 |
+
| LoRA rank (r) | 16 |
|
| 72 |
+
| LoRA alpha | 32 |
|
| 73 |
+
| LoRA dropout | 0.05 |
|
| 74 |
+
| Peak learning rate | 1e-4 |
|
| 75 |
+
| Optimiser | AdamW (β=(0.9, 0.999), ε=1e-8, weight_decay=0.01) |
|
| 76 |
+
| Training steps | 200 (5 epochs × 40 steps) |
|
| 77 |
+
| Batch size | 1 |
|
| 78 |
+
| Diffusion timesteps | DDPMScheduler (1000 steps), random t ∈ [50, 950] |
|
| 79 |
+
| Precision | bfloat16 |
|
| 80 |
+
| Gradient clipping | 1.0 |
|
| 81 |
+
|
| 82 |
+
### Training Data — Warp Physics Scenarios
|
| 83 |
+
|
| 84 |
+
Training uses **synthetic videos rendered from NVIDIA Warp rigid-body simulations**, not real-world video. This eliminates dataset bias and provides ground-truth physically-correct trajectories as supervision.
|
| 85 |
+
|
| 86 |
+
| Scenario | Drop Height | Restitution | Physics Behaviour |
|
| 87 |
+
|----------|:-----------:|:-----------:|-------------------|
|
| 88 |
+
| ball_drop_low | 2m | 0.70 | Low-energy drop, high bounce |
|
| 89 |
+
| ball_drop_high | 5m | 0.60 | Standard gravity, moderate bounce |
|
| 90 |
+
| ball_elastic | 3m | 0.85 | Very elastic — multiple high bounces |
|
| 91 |
+
| ball_heavy | 4m | 0.30 | Inelastic — dead stop after first bounce |
|
| 92 |
+
|
| 93 |
+
### Convergence
|
| 94 |
+
|
| 95 |
+
| Epoch | Avg Loss | Notes |
|
| 96 |
+
|-------|----------|-------|
|
| 97 |
+
| 1 | 1.512 | Warmup spike — expected |
|
| 98 |
+
| 2 | ~0.45 | Fast learning |
|
| 99 |
+
| 5 | **0.341** | Converged — 77% drop from epoch 1 |
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## How to Use
|
| 104 |
+
|
| 105 |
+
### Load the Model
|
| 106 |
+
```python
|