Update README.md

ee03de9 verified 10 days ago

3.84 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-to-video
	- lora
	- physics
	- cogvideox
	- diffusion
	- peft
	- warp
	- rigid-body
	- fine-tuned
	base_model: THUDM/CogVideoX-2b
	pipeline_tag: text-to-video
	library_name: diffusers
	---

	# PhysicsDrivenWorld (PDW)
	### Physics-Corrected Video Generation via Warp-Guided LoRA Fine-Tuning

	> CogVideoX-2b + LoRA (r=16) · NVIDIA Warp Physics · Single H100 NVL

	---

	## Key Result

	\| Metric \| Base CogVideoX-2b \| PDW (Ours) \| Improvement \|
	\|--------\|:-----------------:\|:----------:\|:-----------:\|
	\| Diffusion MSE — test_medium \| 2.2676 \| 0.3861 \| +83.0% \|
	\| Diffusion MSE — test_very_high \| 2.2763 \| 0.3790 \| +83.4% \|
	\| Average \| 2.272 \| 0.383 \| +83.2% \|

	The fine-tuned model predicts noise on physics-correct reference frames 83.2% more accurately than the base model, confirming that the Warp physics prior was successfully injected into the denoising weights.

	---

	## Model Description

	PhysicsDrivenWorld (PDW) fine-tunes [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) using Low-Rank Adaptation (LoRA) supervised by an NVIDIA Warp rigid-body physics simulator.

	Modern video diffusion models generate visually plausible but physically inconsistent results — objects float, bounce unrealistically, or violate Newton's laws. PDW injects a physics prior into the model's denoising weights by training on Warp-simulated ground-truth trajectories.

	The training objective is standard diffusion denoising MSE, but applied exclusively to frames that are physically correct by construction from the Warp simulator — so the model learns to denoise physics-consistent content better than physics-inconsistent content.

	---

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Base Model \| CogVideoX-2b (2B parameter text-to-video diffusion transformer) \|
	\| Adapter \| LoRA — rank r=16, alpha=32 \|
	\| Target Modules \| `to_q`, `to_k`, `to_v`, `to_out.0` (attention projections) \|
	\| Trainable Params \| ~3.7M of 2B total (0.185%) \|
	\| Physics Engine \| NVIDIA Warp 1.11.1 — GPU-accelerated rigid body simulator \|
	\| Simulation \| Semi-implicit Euler, 60 Hz, ground collision with restitution \|
	\| Training Loss \| Diffusion MSE on Warp-generated physics-correct frames \|
	\| LR Schedule \| 10-step linear warmup (1e-6 → 1e-4) then cosine decay to 1e-6 \|
	\| Hardware \| Single NVIDIA H100 NVL (99.9 GB VRAM) — 13.9 GB peak usage \|

	---

	## Training

	### Hyperparameters

	\| Hyperparameter \| Value \|
	\|---------------\|-------\|
	\| LoRA rank (r) \| 16 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| Peak learning rate \| 1e-4 \|
	\| Optimiser \| AdamW (β=(0.9, 0.999), ε=1e-8, weight_decay=0.01) \|
	\| Training steps \| 200 (5 epochs × 40 steps) \|
	\| Batch size \| 1 \|
	\| Diffusion timesteps \| DDPMScheduler (1000 steps), random t ∈ [50, 950] \|
	\| Precision \| bfloat16 \|
	\| Gradient clipping \| 1.0 \|

	### Training Data — Warp Physics Scenarios

	Training uses synthetic videos rendered from NVIDIA Warp rigid-body simulations, not real-world video. This eliminates dataset bias and provides ground-truth physically-correct trajectories as supervision.

	\| Scenario \| Drop Height \| Restitution \| Physics Behaviour \|
	\|----------\|:-----------:\|:-----------:\|-------------------\|
	\| ball_drop_low \| 2m \| 0.70 \| Low-energy drop, high bounce \|
	\| ball_drop_high \| 5m \| 0.60 \| Standard gravity, moderate bounce \|
	\| ball_elastic \| 3m \| 0.85 \| Very elastic — multiple high bounces \|
	\| ball_heavy \| 4m \| 0.30 \| Inelastic — dead stop after first bounce \|

	### Convergence

	\| Epoch \| Avg Loss \| Notes \|
	\|-------\|----------\|-------\|
	\| 1 \| 1.512 \| Warmup spike — expected \|
	\| 2 \| ~0.45 \| Fast learning \|
	\| 5 \| 0.341 \| Converged — 77% drop from epoch 1 \|

	---

	## How to Use

	### Load the Model
	```python

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-to-video
	- lora
	- physics
	- cogvideox
	- diffusion
	- peft
	- warp
	- rigid-body
	- fine-tuned
	base_model: THUDM/CogVideoX-2b
	pipeline_tag: text-to-video
	library_name: diffusers
	---

	# PhysicsDrivenWorld (PDW)
	### Physics-Corrected Video Generation via Warp-Guided LoRA Fine-Tuning

	> CogVideoX-2b + LoRA (r=16) · NVIDIA Warp Physics · Single H100 NVL

	---

	## Key Result

	\| Metric \| Base CogVideoX-2b \| PDW (Ours) \| Improvement \|
	\|--------\|:-----------------:\|:----------:\|:-----------:\|
	\| Diffusion MSE — test_medium \| 2.2676 \| 0.3861 \| +83.0% \|
	\| Diffusion MSE — test_very_high \| 2.2763 \| 0.3790 \| +83.4% \|
	\| Average \| 2.272 \| 0.383 \| +83.2% \|

	The fine-tuned model predicts noise on physics-correct reference frames 83.2% more accurately than the base model, confirming that the Warp physics prior was successfully injected into the denoising weights.

	---

	## Model Description

	PhysicsDrivenWorld (PDW) fine-tunes [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) using Low-Rank Adaptation (LoRA) supervised by an NVIDIA Warp rigid-body physics simulator.

	Modern video diffusion models generate visually plausible but physically inconsistent results — objects float, bounce unrealistically, or violate Newton's laws. PDW injects a physics prior into the model's denoising weights by training on Warp-simulated ground-truth trajectories.

	The training objective is standard diffusion denoising MSE, but applied exclusively to frames that are physically correct by construction from the Warp simulator — so the model learns to denoise physics-consistent content better than physics-inconsistent content.

	---

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Base Model \| CogVideoX-2b (2B parameter text-to-video diffusion transformer) \|
	\| Adapter \| LoRA — rank r=16, alpha=32 \|
	\| Target Modules \| `to_q`, `to_k`, `to_v`, `to_out.0` (attention projections) \|
	\| Trainable Params \| ~3.7M of 2B total (0.185%) \|
	\| Physics Engine \| NVIDIA Warp 1.11.1 — GPU-accelerated rigid body simulator \|
	\| Simulation \| Semi-implicit Euler, 60 Hz, ground collision with restitution \|
	\| Training Loss \| Diffusion MSE on Warp-generated physics-correct frames \|
	\| LR Schedule \| 10-step linear warmup (1e-6 → 1e-4) then cosine decay to 1e-6 \|
	\| Hardware \| Single NVIDIA H100 NVL (99.9 GB VRAM) — 13.9 GB peak usage \|

	---

	## Training

	### Hyperparameters

	\| Hyperparameter \| Value \|
	\|---------------\|-------\|
	\| LoRA rank (r) \| 16 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| Peak learning rate \| 1e-4 \|
	\| Optimiser \| AdamW (β=(0.9, 0.999), ε=1e-8, weight_decay=0.01) \|
	\| Training steps \| 200 (5 epochs × 40 steps) \|
	\| Batch size \| 1 \|
	\| Diffusion timesteps \| DDPMScheduler (1000 steps), random t ∈ [50, 950] \|
	\| Precision \| bfloat16 \|
	\| Gradient clipping \| 1.0 \|

	### Training Data — Warp Physics Scenarios

	Training uses synthetic videos rendered from NVIDIA Warp rigid-body simulations, not real-world video. This eliminates dataset bias and provides ground-truth physically-correct trajectories as supervision.

	\| Scenario \| Drop Height \| Restitution \| Physics Behaviour \|
	\|----------\|:-----------:\|:-----------:\|-------------------\|
	\| ball_drop_low \| 2m \| 0.70 \| Low-energy drop, high bounce \|
	\| ball_drop_high \| 5m \| 0.60 \| Standard gravity, moderate bounce \|
	\| ball_elastic \| 3m \| 0.85 \| Very elastic — multiple high bounces \|
	\| ball_heavy \| 4m \| 0.30 \| Inelastic — dead stop after first bounce \|

	### Convergence

	\| Epoch \| Avg Loss \| Notes \|
	\|-------\|----------\|-------\|
	\| 1 \| 1.512 \| Warmup spike — expected \|
	\| 2 \| ~0.45 \| Fast learning \|
	\| 5 \| 0.341 \| Converged — 77% drop from epoch 1 \|

	---

	## How to Use

	### Load the Model
	```python