Vizuara
/

dreamzero-so101-lora

+---
+license: apache-2.0
+language:
+- en
+library_name: dreamzero
+tags:
+- robotics
+- world-model
+- world-action-model
+- so-101
+- lerobot
+- video-generation
+- vla
+- flow-matching
+- lora
+base_model: Wan-AI/Wan2.1-I2V-14B-480P
+datasets:
+- whosricky/so101-megamix-v1
+pipeline_tag: robotics
+---
+# DreamZero-SO101 (LoRA, 70K steps)
+A **World Action Model (WAM)** for the [SO-101 robot arm](https://github.com/TheRobotStudio/SO-ARM100), fine-tuned from [DreamZero](https://github.com/dreamzero0/dreamzero) (Wan2.1-I2V-14B + joint action heads). Given a single camera observation and a natural-language task, it jointly predicts:
+- **24 future 6-DOF joint actions** (`shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper`)
+- **33 future video frames** showing the predicted task execution
+Both modalities are denoised in a single forward pass using flow matching, so the model is internally consistent — the predicted actions and the predicted video describe the same imagined rollout.
+![Training curve](./training_curve.png)
+> **2× H100 80GB · 72K steps · ~127 hours · rank-4 LoRA · joint flow matching**
+> Final action loss: **0.0015** (166× drop) · Final dynamics loss: **0.0298** (6× drop)
+## Why a "World Action Model"?
+Most robot policies output actions and treat the world as a black box. A World Action Model also predicts what the world will *look like* during the rollout. This has three useful properties:
+1. **Self-consistency** — actions and predicted video share the same denoising trajectory, so the model has to imagine a coherent future
+2. **Interpretability** — you can literally watch what the policy "thinks" will happen before sending actions to the robot
+3. **Sim-free evaluation** — the predicted video gives you a free imagined rollout you can score offline
+## Quick Start
+```bash
+pip install huggingface_hub safetensors torch
+# Download base + LoRA
+huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./checkpoints/Wan2.1-I2V-14B-480P
+huggingface-cli download Vizuara/dreamzero-so101-lora --local-dir ./checkpoints/dreamzero-so101-lora
+# Clone DreamZero codebase + apply SO-101 patch
+git clone https://github.com/dreamzero0/dreamzero.git
+cd dreamzero && pip install -e .
+git clone https://github.com/vizuara/dreamzero-so101.git
+cd dreamzero && git apply ../dreamzero-so101/patches/so101_embodiment.patch
+# Run inference
+python ../dreamzero-so101/scripts/infer_demo.py \
+  --model-path ./checkpoints/dreamzero-so101-lora \
+  --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
+  --image ./sample_obs.jpg \
+  --prompt "Pick up the red cube and place it in the bowl"
+```
+## Model Details
+| | |
+|---|---|
+| **Base model** | [Wan-AI/Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) |
+| **Backbone** | DiT, 40 layers, d=5120, 40 heads, 14B params |
+| **Tokenizers** | UMT5-XXL (text) · CLIP ViT-H/14 (image) · WanVAE (video, 4×8×8) |
+| **Action head** | Causal Wan + flow-matching action transformer |
+| **Action format** | Relative joint positions, 6-DOF (padded to 32) |
+| **State format** | Joint positions, 6-DOF (padded to 64) |
+| **Video resolution** | 320 × 176 |
+| **Frames** | 33 RGB → 9 latent (4× temporal compression) |
+| **Action horizon** | 24 steps |
+| **Inference steps** | 4 Euler steps (~600 ms on H100) |
+| **Trainable params** | ~50 M (LoRA) + action heads |
+### LoRA Configuration
+| | |
+|---|---|
+| Rank | 4 |
+| Alpha | 4 |
+| Targets | `q, k, v, o, ffn.0, ffn.2` |
+| Init | Kaiming |
+## Training Recipe
+| | |
+|---|---|
+| Hardware | 2× H100 80GB |
+| Optimizer | AdamW (DeepSpeed ZeRO-2) |
+| Learning rate | 1e-4, cosine decay |
+| Warmup | 5 % |
+| Weight decay | 1e-5 |
+| Batch size | 1 per GPU (effective 2) |
+| Precision | bfloat16 + tf32 |
+| Gradient checkpointing | Yes |
+| Steps trained | 72,000 (loss converged; 100K planned but stopped early) |
+| Wall-clock | ~127 hours |
+| Dataset | [`whosricky/so101-megamix-v1`](https://huggingface.co/datasets/whosricky/so101-megamix-v1) (400 episodes, 8 tasks, 3 cameras) |
+| Loss | Joint flow-matching velocity (action + dynamics) with uncertainty weighting |
+## Results
+### Final Loss
+| Metric | Initial | Final | Reduction |
+|---|---|---|---|
+| Action loss | 0.249 | **0.0015** | 166× |
+| Dynamics (video) loss | 0.176 | **0.0298** | 6× |
+The dynamics loss converged around step 30K and remained flat. The action loss continued to slowly improve through 70K. Training was halted at step ~72K due to a pod migration; the loss curve indicates the model is well-converged and additional steps would have offered marginal returns.
+## Files
+| File | Size | Description |
+|---|---|---|
+| `model.safetensors` | 207 MB | LoRA weights + action heads (bf16) |
+| `config.json` | 4 KB | Model configuration (architecture, action head, LoRA settings) |
+| `loss_log.jsonl` | 917 KB | Per-step training loss (15,912 entries) |
+| `training_curve.png` | 220 KB | Training loss visualization |
+## Intended Use
+- ✅ **Research & education** — studying world models and joint video/action prediction for robotics
+- ✅ **SO-101 manipulation policy bootstrap** — fine-tune further on your own data
+- ✅ **Offline rollout visualization** — predict-before-execute to debug task setups
+- ⚠️ **Real-robot deployment** — possible but requires safety wrappers and additional fine-tuning on your specific embodiment / camera setup
+- ❌ **Other arm types** — trained only on SO-101; do not expect zero-shot transfer
+## Limitations
+- Trained only on `whosricky/so101-megamix-v1` (400 episodes, 8 tasks). Out-of-distribution objects/scenes will degrade quality.
+- Joint trajectories are predicted, not torques — your low-level controller must accept position targets.
+- Video predictions are 33-frame snippets (~1 sec at 30 FPS). Longer horizons require chunked rollout.
+- LoRA only — for best quality, do a full fine-tune (see [`dreamzero-so101`](https://github.com/vizuara/dreamzero-so101) repo).
+- Trained at 320×176 — higher resolutions need re-training.
+## Citation
+```bibtex
+@misc{dreamzero-so101-2026,
+  title  = {DreamZero-SO101: A World Action Model for the SO-101 Robot Arm},
+  author = {Vizuara AI Labs},
+  year   = {2026},
+  howpublished = {\url{https://huggingface.co/Vizuara/dreamzero-so101-lora}},
+  note   = {LoRA fine-tune of DreamZero on aggregated SO-101 LeRobot datasets}
+}
+```
+Please also cite the underlying work:
+- **DreamZero** — Liu et al., "Learning World Action Models from Video", GEAR Lab, 2025
+- **Wan2.1** — Wan-AI, "Wan2.1-I2V-14B", 2025
+- **SO-101** — TheRobotStudio, "SO-100/SO-101 Open-Source Robot Arm"
+- **LeRobot** — HuggingFace, 2024
+## Acknowledgments
+- [DreamZero](https://github.com/dreamzero0/dreamzero) by GEAR Lab — Apache 2.0 codebase
+- [Wan2.1](https://github.com/Wan-Video/Wan2.1) — video generation backbone
+- [LeRobot](https://github.com/huggingface/lerobot) — dataset format and community
+- SO-101 dataset contributors on HuggingFace Hub
+## License
+Apache 2.0 (same as DreamZero)