DreamZero-SO101 (LoRA, 70K steps)
A World Action Model (WAM) for the SO-101 robot arm, fine-tuned from DreamZero (Wan2.1-I2V-14B + joint action heads). Given a single camera observation and a natural-language task, it jointly predicts:
- 24 future 6-DOF joint actions (
shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper) - 33 future video frames showing the predicted task execution
Both modalities are denoised in a single forward pass using flow matching, so the model is internally consistent β the predicted actions and the predicted video describe the same imagined rollout.
2Γ H100 80GB Β· 72K steps Β· ~127 hours Β· rank-4 LoRA Β· joint flow matching Final action loss: 0.0015 (166Γ drop) Β· Final dynamics loss: 0.0298 (6Γ drop)
Why a "World Action Model"?
Most robot policies output actions and treat the world as a black box. A World Action Model also predicts what the world will look like during the rollout. This has three useful properties:
- Self-consistency β actions and predicted video share the same denoising trajectory, so the model has to imagine a coherent future
- Interpretability β you can literally watch what the policy "thinks" will happen before sending actions to the robot
- Sim-free evaluation β the predicted video gives you a free imagined rollout you can score offline
Quick Start
pip install huggingface_hub safetensors torch
# Download base + LoRA
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./checkpoints/Wan2.1-I2V-14B-480P
huggingface-cli download Vizuara/dreamzero-so101-lora --local-dir ./checkpoints/dreamzero-so101-lora
# Clone DreamZero codebase + apply SO-101 patch
git clone https://github.com/dreamzero0/dreamzero.git
cd dreamzero && pip install -e .
git clone https://github.com/vizuara/dreamzero-so101.git
cd dreamzero && git apply ../dreamzero-so101/patches/so101_embodiment.patch
# Run inference
python ../dreamzero-so101/scripts/infer_demo.py \
--model-path ./checkpoints/dreamzero-so101-lora \
--base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
--image ./sample_obs.jpg \
--prompt "Pick up the red cube and place it in the bowl"
Model Details
| Base model | Wan-AI/Wan2.1-I2V-14B-480P |
| Backbone | DiT, 40 layers, d=5120, 40 heads, 14B params |
| Tokenizers | UMT5-XXL (text) Β· CLIP ViT-H/14 (image) Β· WanVAE (video, 4Γ8Γ8) |
| Action head | Causal Wan + flow-matching action transformer |
| Action format | Relative joint positions, 6-DOF (padded to 32) |
| State format | Joint positions, 6-DOF (padded to 64) |
| Video resolution | 320 Γ 176 |
| Frames | 33 RGB β 9 latent (4Γ temporal compression) |
| Action horizon | 24 steps |
| Inference steps | 4 Euler steps (~600 ms on H100) |
| Trainable params | ~50 M (LoRA) + action heads |
LoRA Configuration
| Rank | 4 |
| Alpha | 4 |
| Targets | q, k, v, o, ffn.0, ffn.2 |
| Init | Kaiming |
Training Recipe
| Hardware | 2Γ H100 80GB |
| Optimizer | AdamW (DeepSpeed ZeRO-2) |
| Learning rate | 1e-4, cosine decay |
| Warmup | 5 % |
| Weight decay | 1e-5 |
| Batch size | 1 per GPU (effective 2) |
| Precision | bfloat16 + tf32 |
| Gradient checkpointing | Yes |
| Steps trained | 72,000 (loss converged; 100K planned but stopped early) |
| Wall-clock | ~127 hours |
| Dataset | whosricky/so101-megamix-v1 (400 episodes, 8 tasks, 3 cameras) |
| Loss | Joint flow-matching velocity (action + dynamics) with uncertainty weighting |
Results
Final Loss
| Metric | Initial | Final | Reduction |
|---|---|---|---|
| Action loss | 0.249 | 0.0015 | 166Γ |
| Dynamics (video) loss | 0.176 | 0.0298 | 6Γ |
The dynamics loss converged around step 30K and remained flat. The action loss continued to slowly improve through 70K. Training was halted at step ~72K due to a pod migration; the loss curve indicates the model is well-converged and additional steps would have offered marginal returns.
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
207 MB | LoRA weights + action heads (bf16) |
config.json |
4 KB | Model configuration (architecture, action head, LoRA settings) |
loss_log.jsonl |
917 KB | Per-step training loss (15,912 entries) |
training_curve.png |
220 KB | Training loss visualization |
Intended Use
- β Research & education β studying world models and joint video/action prediction for robotics
- β SO-101 manipulation policy bootstrap β fine-tune further on your own data
- β Offline rollout visualization β predict-before-execute to debug task setups
- β οΈ Real-robot deployment β possible but requires safety wrappers and additional fine-tuning on your specific embodiment / camera setup
- β Other arm types β trained only on SO-101; do not expect zero-shot transfer
Limitations
- Trained only on
whosricky/so101-megamix-v1(400 episodes, 8 tasks). Out-of-distribution objects/scenes will degrade quality. - Joint trajectories are predicted, not torques β your low-level controller must accept position targets.
- Video predictions are 33-frame snippets (~1 sec at 30 FPS). Longer horizons require chunked rollout.
- LoRA only β for best quality, do a full fine-tune (see
dreamzero-so101repo). - Trained at 320Γ176 β higher resolutions need re-training.
Citation
@misc{dreamzero-so101-2026,
title = {DreamZero-SO101: A World Action Model for the SO-101 Robot Arm},
author = {Vizuara AI Labs},
year = {2026},
howpublished = {\url{https://huggingface.co/Vizuara/dreamzero-so101-lora}},
note = {LoRA fine-tune of DreamZero on aggregated SO-101 LeRobot datasets}
}
Please also cite the underlying work:
- DreamZero β Liu et al., "Learning World Action Models from Video", GEAR Lab, 2025
- Wan2.1 β Wan-AI, "Wan2.1-I2V-14B", 2025
- SO-101 β TheRobotStudio, "SO-100/SO-101 Open-Source Robot Arm"
- LeRobot β HuggingFace, 2024
Acknowledgments
- DreamZero by GEAR Lab β Apache 2.0 codebase
- Wan2.1 β video generation backbone
- LeRobot β dataset format and community
- SO-101 dataset contributors on HuggingFace Hub
License
Apache 2.0 (same as DreamZero)
- Downloads last month
- 38
Model tree for Vizuara/dreamzero-so101-lora
Base model
Wan-AI/Wan2.1-I2V-14B-480P