Vizuara commited on
Commit
496d5f5
Β·
verified Β·
1 Parent(s): cbfef8d

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -0
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: dreamzero
6
+ tags:
7
+ - robotics
8
+ - world-model
9
+ - world-action-model
10
+ - so-101
11
+ - lerobot
12
+ - video-generation
13
+ - vla
14
+ - flow-matching
15
+ - lora
16
+ base_model: Wan-AI/Wan2.1-I2V-14B-480P
17
+ datasets:
18
+ - whosricky/so101-megamix-v1
19
+ pipeline_tag: robotics
20
+ ---
21
+
22
+ # DreamZero-SO101 (LoRA, 70K steps)
23
+
24
+ A **World Action Model (WAM)** for the [SO-101 robot arm](https://github.com/TheRobotStudio/SO-ARM100), fine-tuned from [DreamZero](https://github.com/dreamzero0/dreamzero) (Wan2.1-I2V-14B + joint action heads). Given a single camera observation and a natural-language task, it jointly predicts:
25
+
26
+ - **24 future 6-DOF joint actions** (`shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper`)
27
+ - **33 future video frames** showing the predicted task execution
28
+
29
+ Both modalities are denoised in a single forward pass using flow matching, so the model is internally consistent β€” the predicted actions and the predicted video describe the same imagined rollout.
30
+
31
+ ![Training curve](./training_curve.png)
32
+
33
+ > **2Γ— H100 80GB Β· 72K steps Β· ~127 hours Β· rank-4 LoRA Β· joint flow matching**
34
+ > Final action loss: **0.0015** (166Γ— drop) Β· Final dynamics loss: **0.0298** (6Γ— drop)
35
+
36
+ ## Why a "World Action Model"?
37
+
38
+ Most robot policies output actions and treat the world as a black box. A World Action Model also predicts what the world will *look like* during the rollout. This has three useful properties:
39
+
40
+ 1. **Self-consistency** β€” actions and predicted video share the same denoising trajectory, so the model has to imagine a coherent future
41
+ 2. **Interpretability** β€” you can literally watch what the policy "thinks" will happen before sending actions to the robot
42
+ 3. **Sim-free evaluation** β€” the predicted video gives you a free imagined rollout you can score offline
43
+
44
+ ## Quick Start
45
+
46
+ ```bash
47
+ pip install huggingface_hub safetensors torch
48
+
49
+ # Download base + LoRA
50
+ huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./checkpoints/Wan2.1-I2V-14B-480P
51
+ huggingface-cli download Vizuara/dreamzero-so101-lora --local-dir ./checkpoints/dreamzero-so101-lora
52
+
53
+ # Clone DreamZero codebase + apply SO-101 patch
54
+ git clone https://github.com/dreamzero0/dreamzero.git
55
+ cd dreamzero && pip install -e .
56
+ git clone https://github.com/vizuara/dreamzero-so101.git
57
+ cd dreamzero && git apply ../dreamzero-so101/patches/so101_embodiment.patch
58
+
59
+ # Run inference
60
+ python ../dreamzero-so101/scripts/infer_demo.py \
61
+ --model-path ./checkpoints/dreamzero-so101-lora \
62
+ --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
63
+ --image ./sample_obs.jpg \
64
+ --prompt "Pick up the red cube and place it in the bowl"
65
+ ```
66
+
67
+ ## Model Details
68
+
69
+ | | |
70
+ |---|---|
71
+ | **Base model** | [Wan-AI/Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) |
72
+ | **Backbone** | DiT, 40 layers, d=5120, 40 heads, 14B params |
73
+ | **Tokenizers** | UMT5-XXL (text) Β· CLIP ViT-H/14 (image) Β· WanVAE (video, 4Γ—8Γ—8) |
74
+ | **Action head** | Causal Wan + flow-matching action transformer |
75
+ | **Action format** | Relative joint positions, 6-DOF (padded to 32) |
76
+ | **State format** | Joint positions, 6-DOF (padded to 64) |
77
+ | **Video resolution** | 320 Γ— 176 |
78
+ | **Frames** | 33 RGB β†’ 9 latent (4Γ— temporal compression) |
79
+ | **Action horizon** | 24 steps |
80
+ | **Inference steps** | 4 Euler steps (~600 ms on H100) |
81
+ | **Trainable params** | ~50 M (LoRA) + action heads |
82
+
83
+ ### LoRA Configuration
84
+
85
+ | | |
86
+ |---|---|
87
+ | Rank | 4 |
88
+ | Alpha | 4 |
89
+ | Targets | `q, k, v, o, ffn.0, ffn.2` |
90
+ | Init | Kaiming |
91
+
92
+ ## Training Recipe
93
+
94
+ | | |
95
+ |---|---|
96
+ | Hardware | 2Γ— H100 80GB |
97
+ | Optimizer | AdamW (DeepSpeed ZeRO-2) |
98
+ | Learning rate | 1e-4, cosine decay |
99
+ | Warmup | 5 % |
100
+ | Weight decay | 1e-5 |
101
+ | Batch size | 1 per GPU (effective 2) |
102
+ | Precision | bfloat16 + tf32 |
103
+ | Gradient checkpointing | Yes |
104
+ | Steps trained | 72,000 (loss converged; 100K planned but stopped early) |
105
+ | Wall-clock | ~127 hours |
106
+ | Dataset | [`whosricky/so101-megamix-v1`](https://huggingface.co/datasets/whosricky/so101-megamix-v1) (400 episodes, 8 tasks, 3 cameras) |
107
+ | Loss | Joint flow-matching velocity (action + dynamics) with uncertainty weighting |
108
+
109
+ ## Results
110
+
111
+ ### Final Loss
112
+ | Metric | Initial | Final | Reduction |
113
+ |---|---|---|---|
114
+ | Action loss | 0.249 | **0.0015** | 166Γ— |
115
+ | Dynamics (video) loss | 0.176 | **0.0298** | 6Γ— |
116
+
117
+ The dynamics loss converged around step 30K and remained flat. The action loss continued to slowly improve through 70K. Training was halted at step ~72K due to a pod migration; the loss curve indicates the model is well-converged and additional steps would have offered marginal returns.
118
+
119
+ ## Files
120
+
121
+ | File | Size | Description |
122
+ |---|---|---|
123
+ | `model.safetensors` | 207 MB | LoRA weights + action heads (bf16) |
124
+ | `config.json` | 4 KB | Model configuration (architecture, action head, LoRA settings) |
125
+ | `loss_log.jsonl` | 917 KB | Per-step training loss (15,912 entries) |
126
+ | `training_curve.png` | 220 KB | Training loss visualization |
127
+
128
+ ## Intended Use
129
+
130
+ - βœ… **Research & education** β€” studying world models and joint video/action prediction for robotics
131
+ - βœ… **SO-101 manipulation policy bootstrap** β€” fine-tune further on your own data
132
+ - βœ… **Offline rollout visualization** β€” predict-before-execute to debug task setups
133
+ - ⚠️ **Real-robot deployment** β€” possible but requires safety wrappers and additional fine-tuning on your specific embodiment / camera setup
134
+ - ❌ **Other arm types** β€” trained only on SO-101; do not expect zero-shot transfer
135
+
136
+ ## Limitations
137
+
138
+ - Trained only on `whosricky/so101-megamix-v1` (400 episodes, 8 tasks). Out-of-distribution objects/scenes will degrade quality.
139
+ - Joint trajectories are predicted, not torques β€” your low-level controller must accept position targets.
140
+ - Video predictions are 33-frame snippets (~1 sec at 30 FPS). Longer horizons require chunked rollout.
141
+ - LoRA only β€” for best quality, do a full fine-tune (see [`dreamzero-so101`](https://github.com/vizuara/dreamzero-so101) repo).
142
+ - Trained at 320Γ—176 β€” higher resolutions need re-training.
143
+
144
+ ## Citation
145
+
146
+ ```bibtex
147
+ @misc{dreamzero-so101-2026,
148
+ title = {DreamZero-SO101: A World Action Model for the SO-101 Robot Arm},
149
+ author = {Vizuara AI Labs},
150
+ year = {2026},
151
+ howpublished = {\url{https://huggingface.co/Vizuara/dreamzero-so101-lora}},
152
+ note = {LoRA fine-tune of DreamZero on aggregated SO-101 LeRobot datasets}
153
+ }
154
+ ```
155
+
156
+ Please also cite the underlying work:
157
+
158
+ - **DreamZero** β€” Liu et al., "Learning World Action Models from Video", GEAR Lab, 2025
159
+ - **Wan2.1** β€” Wan-AI, "Wan2.1-I2V-14B", 2025
160
+ - **SO-101** β€” TheRobotStudio, "SO-100/SO-101 Open-Source Robot Arm"
161
+ - **LeRobot** β€” HuggingFace, 2024
162
+
163
+ ## Acknowledgments
164
+
165
+ - [DreamZero](https://github.com/dreamzero0/dreamzero) by GEAR Lab β€” Apache 2.0 codebase
166
+ - [Wan2.1](https://github.com/Wan-Video/Wan2.1) β€” video generation backbone
167
+ - [LeRobot](https://github.com/huggingface/lerobot) β€” dataset format and community
168
+ - SO-101 dataset contributors on HuggingFace Hub
169
+
170
+ ## License
171
+
172
+ Apache 2.0 (same as DreamZero)