DreamZero-SO101 (LoRA, 70K steps)

A World Action Model (WAM) for the SO-101 robot arm, fine-tuned from DreamZero (Wan2.1-I2V-14B + joint action heads). Given a single camera observation and a natural-language task, it jointly predicts:

24 future 6-DOF joint actions (shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper)
33 future video frames showing the predicted task execution

Both modalities are denoised in a single forward pass using flow matching, so the model is internally consistent — the predicted actions and the predicted video describe the same imagined rollout.

2× H100 80GB · 72K steps · ~127 hours · rank-4 LoRA · joint flow matching Final action loss: 0.0015 (166× drop) · Final dynamics loss: 0.0298 (6× drop)

Why a "World Action Model"?

Most robot policies output actions and treat the world as a black box. A World Action Model also predicts what the world will look like during the rollout. This has three useful properties:

Self-consistency — actions and predicted video share the same denoising trajectory, so the model has to imagine a coherent future
Interpretability — you can literally watch what the policy "thinks" will happen before sending actions to the robot
Sim-free evaluation — the predicted video gives you a free imagined rollout you can score offline

Quick Start

pip install huggingface_hub safetensors torch

# Download base + LoRA
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./checkpoints/Wan2.1-I2V-14B-480P
huggingface-cli download Vizuara/dreamzero-so101-lora --local-dir ./checkpoints/dreamzero-so101-lora

# Clone DreamZero codebase + apply SO-101 patch
git clone https://github.com/dreamzero0/dreamzero.git
cd dreamzero && pip install -e .
git clone https://github.com/vizuara/dreamzero-so101.git
cd dreamzero && git apply ../dreamzero-so101/patches/so101_embodiment.patch

# Run inference
python ../dreamzero-so101/scripts/infer_demo.py \
  --model-path ./checkpoints/dreamzero-so101-lora \
  --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
  --image ./sample_obs.jpg \
  --prompt "Pick up the red cube and place it in the bowl"

Model Details


Base model	Wan-AI/Wan2.1-I2V-14B-480P
Backbone	DiT, 40 layers, d=5120, 40 heads, 14B params
Tokenizers	UMT5-XXL (text) · CLIP ViT-H/14 (image) · WanVAE (video, 4×8×8)
Action head	Causal Wan + flow-matching action transformer
Action format	Relative joint positions, 6-DOF (padded to 32)
State format	Joint positions, 6-DOF (padded to 64)
Video resolution	320 × 176
Frames	33 RGB → 9 latent (4× temporal compression)
Action horizon	24 steps
Inference steps	4 Euler steps (~600 ms on H100)
Trainable params	~50 M (LoRA) + action heads

LoRA Configuration


Rank	4
Alpha	4
Targets	`q, k, v, o, ffn.0, ffn.2`
Init	Kaiming

Training Recipe


Hardware	2× H100 80GB
Optimizer	AdamW (DeepSpeed ZeRO-2)
Learning rate	1e-4, cosine decay
Warmup	5 %
Weight decay	1e-5
Batch size	1 per GPU (effective 2)
Precision	bfloat16 + tf32
Gradient checkpointing	Yes
Steps trained	72,000 (loss converged; 100K planned but stopped early)
Wall-clock	~127 hours
Dataset	`whosricky/so101-megamix-v1` (400 episodes, 8 tasks, 3 cameras)
Loss	Joint flow-matching velocity (action + dynamics) with uncertainty weighting

Results

Final Loss

Metric	Initial	Final	Reduction
Action loss	0.249	0.0015	166×
Dynamics (video) loss	0.176	0.0298	6×

The dynamics loss converged around step 30K and remained flat. The action loss continued to slowly improve through 70K. Training was halted at step ~72K due to a pod migration; the loss curve indicates the model is well-converged and additional steps would have offered marginal returns.

Files

File	Size	Description
`model.safetensors`	207 MB	LoRA weights + action heads (bf16)
`config.json`	4 KB	Model configuration (architecture, action head, LoRA settings)
`loss_log.jsonl`	917 KB	Per-step training loss (15,912 entries)
`training_curve.png`	220 KB	Training loss visualization

Intended Use

✅ Research & education — studying world models and joint video/action prediction for robotics
✅ SO-101 manipulation policy bootstrap — fine-tune further on your own data
✅ Offline rollout visualization — predict-before-execute to debug task setups
⚠️ Real-robot deployment — possible but requires safety wrappers and additional fine-tuning on your specific embodiment / camera setup
❌ Other arm types — trained only on SO-101; do not expect zero-shot transfer

Limitations

Trained only on whosricky/so101-megamix-v1 (400 episodes, 8 tasks). Out-of-distribution objects/scenes will degrade quality.
Joint trajectories are predicted, not torques — your low-level controller must accept position targets.
Video predictions are 33-frame snippets (~1 sec at 30 FPS). Longer horizons require chunked rollout.
LoRA only — for best quality, do a full fine-tune (see dreamzero-so101 repo).
Trained at 320×176 — higher resolutions need re-training.

Citation

@misc{dreamzero-so101-2026,
  title  = {DreamZero-SO101: A World Action Model for the SO-101 Robot Arm},
  author = {Vizuara AI Labs},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Vizuara/dreamzero-so101-lora}},
  note   = {LoRA fine-tune of DreamZero on aggregated SO-101 LeRobot datasets}
}

Please also cite the underlying work:

DreamZero — Liu et al., "Learning World Action Models from Video", GEAR Lab, 2025
Wan2.1 — Wan-AI, "Wan2.1-I2V-14B", 2025
SO-101 — TheRobotStudio, "SO-100/SO-101 Open-Source Robot Arm"
LeRobot — HuggingFace, 2024

Acknowledgments

DreamZero by GEAR Lab — Apache 2.0 codebase
Wan2.1 — video generation backbone
LeRobot — dataset format and community
SO-101 dataset contributors on HuggingFace Hub

License

Apache 2.0 (same as DreamZero)

Downloads last month: 26

Safetensors

Model size

0.1B params

Tensor type

BF16

Video Preview

Robotics

Model tree for Vizuara/dreamzero-so101-lora

Base model

Wan-AI/Wan2.1-I2V-14B-480P

Adapter

(115)

this model

Vizuara
/

dreamzero-so101-lora