DreamZero-SO101 (LoRA, 70K steps)

A World Action Model (WAM) for the SO-101 robot arm, fine-tuned from DreamZero (Wan2.1-I2V-14B + joint action heads). Given a single camera observation and a natural-language task, it jointly predicts:

  • 24 future 6-DOF joint actions (shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper)
  • 33 future video frames showing the predicted task execution

Both modalities are denoised in a single forward pass using flow matching, so the model is internally consistent β€” the predicted actions and the predicted video describe the same imagined rollout.

Training curve

2Γ— H100 80GB Β· 72K steps Β· ~127 hours Β· rank-4 LoRA Β· joint flow matching Final action loss: 0.0015 (166Γ— drop) Β· Final dynamics loss: 0.0298 (6Γ— drop)

Why a "World Action Model"?

Most robot policies output actions and treat the world as a black box. A World Action Model also predicts what the world will look like during the rollout. This has three useful properties:

  1. Self-consistency β€” actions and predicted video share the same denoising trajectory, so the model has to imagine a coherent future
  2. Interpretability β€” you can literally watch what the policy "thinks" will happen before sending actions to the robot
  3. Sim-free evaluation β€” the predicted video gives you a free imagined rollout you can score offline

Quick Start

pip install huggingface_hub safetensors torch

# Download base + LoRA
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./checkpoints/Wan2.1-I2V-14B-480P
huggingface-cli download Vizuara/dreamzero-so101-lora --local-dir ./checkpoints/dreamzero-so101-lora

# Clone DreamZero codebase + apply SO-101 patch
git clone https://github.com/dreamzero0/dreamzero.git
cd dreamzero && pip install -e .
git clone https://github.com/vizuara/dreamzero-so101.git
cd dreamzero && git apply ../dreamzero-so101/patches/so101_embodiment.patch

# Run inference
python ../dreamzero-so101/scripts/infer_demo.py \
  --model-path ./checkpoints/dreamzero-so101-lora \
  --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
  --image ./sample_obs.jpg \
  --prompt "Pick up the red cube and place it in the bowl"

Model Details

Base model Wan-AI/Wan2.1-I2V-14B-480P
Backbone DiT, 40 layers, d=5120, 40 heads, 14B params
Tokenizers UMT5-XXL (text) Β· CLIP ViT-H/14 (image) Β· WanVAE (video, 4Γ—8Γ—8)
Action head Causal Wan + flow-matching action transformer
Action format Relative joint positions, 6-DOF (padded to 32)
State format Joint positions, 6-DOF (padded to 64)
Video resolution 320 Γ— 176
Frames 33 RGB β†’ 9 latent (4Γ— temporal compression)
Action horizon 24 steps
Inference steps 4 Euler steps (~600 ms on H100)
Trainable params ~50 M (LoRA) + action heads

LoRA Configuration

Rank 4
Alpha 4
Targets q, k, v, o, ffn.0, ffn.2
Init Kaiming

Training Recipe

Hardware 2Γ— H100 80GB
Optimizer AdamW (DeepSpeed ZeRO-2)
Learning rate 1e-4, cosine decay
Warmup 5 %
Weight decay 1e-5
Batch size 1 per GPU (effective 2)
Precision bfloat16 + tf32
Gradient checkpointing Yes
Steps trained 72,000 (loss converged; 100K planned but stopped early)
Wall-clock ~127 hours
Dataset whosricky/so101-megamix-v1 (400 episodes, 8 tasks, 3 cameras)
Loss Joint flow-matching velocity (action + dynamics) with uncertainty weighting

Results

Final Loss

Metric Initial Final Reduction
Action loss 0.249 0.0015 166Γ—
Dynamics (video) loss 0.176 0.0298 6Γ—

The dynamics loss converged around step 30K and remained flat. The action loss continued to slowly improve through 70K. Training was halted at step ~72K due to a pod migration; the loss curve indicates the model is well-converged and additional steps would have offered marginal returns.

Files

File Size Description
model.safetensors 207 MB LoRA weights + action heads (bf16)
config.json 4 KB Model configuration (architecture, action head, LoRA settings)
loss_log.jsonl 917 KB Per-step training loss (15,912 entries)
training_curve.png 220 KB Training loss visualization

Intended Use

  • βœ… Research & education β€” studying world models and joint video/action prediction for robotics
  • βœ… SO-101 manipulation policy bootstrap β€” fine-tune further on your own data
  • βœ… Offline rollout visualization β€” predict-before-execute to debug task setups
  • ⚠️ Real-robot deployment β€” possible but requires safety wrappers and additional fine-tuning on your specific embodiment / camera setup
  • ❌ Other arm types β€” trained only on SO-101; do not expect zero-shot transfer

Limitations

  • Trained only on whosricky/so101-megamix-v1 (400 episodes, 8 tasks). Out-of-distribution objects/scenes will degrade quality.
  • Joint trajectories are predicted, not torques β€” your low-level controller must accept position targets.
  • Video predictions are 33-frame snippets (~1 sec at 30 FPS). Longer horizons require chunked rollout.
  • LoRA only β€” for best quality, do a full fine-tune (see dreamzero-so101 repo).
  • Trained at 320Γ—176 β€” higher resolutions need re-training.

Citation

@misc{dreamzero-so101-2026,
  title  = {DreamZero-SO101: A World Action Model for the SO-101 Robot Arm},
  author = {Vizuara AI Labs},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Vizuara/dreamzero-so101-lora}},
  note   = {LoRA fine-tune of DreamZero on aggregated SO-101 LeRobot datasets}
}

Please also cite the underlying work:

  • DreamZero β€” Liu et al., "Learning World Action Models from Video", GEAR Lab, 2025
  • Wan2.1 β€” Wan-AI, "Wan2.1-I2V-14B", 2025
  • SO-101 β€” TheRobotStudio, "SO-100/SO-101 Open-Source Robot Arm"
  • LeRobot β€” HuggingFace, 2024

Acknowledgments

  • DreamZero by GEAR Lab β€” Apache 2.0 codebase
  • Wan2.1 β€” video generation backbone
  • LeRobot β€” dataset format and community
  • SO-101 dataset contributors on HuggingFace Hub

License

Apache 2.0 (same as DreamZero)

Downloads last month
38
Safetensors
Model size
0.1B params
Tensor type
BF16
Β·
Video Preview
loading

Model tree for Vizuara/dreamzero-so101-lora

Adapter
(113)
this model

Dataset used to train Vizuara/dreamzero-so101-lora