mickeykang's picture
Final checkpoint: 10 epochs (step 735270)
b65b4ad verified
|
Raw
History Blame Contribute Delete
2.61 kB
---
license: apache-2.0
library_name: lerobot
tags:
- robotics
- vla
- smolvla
- flow-matching
- dom
---
# SmolVLA-MultiFrame on DOM β€” final, 10 epochs (step 735,270)
Multi-frame **SmolVLA** fine-tuned on the **DOM** (Dynamic Object Manipulation) dataset.
Root holds the **final** checkpoint: **10 full epochs** of DOM (step 735,270 = 10 Γ— 73,527), final loss β‰ˆ 0.0015.
Training completed and auto-stopped at the 10-epoch target.
## What this is
- **Backbone:** `lerobot/smolvla_base` β€” SmolVLM2-500M-Video-Instruct (SigLIP vision + SmolLM2)
+ flow-matching action expert. VL-aligned + robot-pretrained.
- **Training:** **full fine-tune** (403M / 450M trainable, vision encoder unfrozen) on `hzxie/DOM`
(Franka, cameras `opst_cam` + `wrist_cam`, state 6-d, action 7-d, chunk 50).
- **Multi-frame:** temporal window **{t-2, t}** (`DELTA_TIMESTAMPS observation: [-2, 0]`) β€” each frame is
fed to SmolVLM2 as a **separate image** so the model perceives object motion (DOM is dynamic).
- **Setup:** 8Γ—H200, global batch 640 (40 Γ— grad_accum 2 Γ— 8), AdamW lr 1e-4, cosine + 1000 warmup, bf16.
~12 days wall-clock for 10 epochs.
## ⚠️ Important β€” load with MultiFrameSmolVLAPolicy
`config.json` has `type: "smolvla"`, but this checkpoint was trained to consume **two frames per camera**.
Loading it with the stock `SmolVLAPolicy` uses **only the last frame** (single-frame) and loses the
multi-frame behavior. For correct inference use **`MultiFrameSmolVLAPolicy`** and feed a 2-frame window:
```python
# from the repo branch below: policies/smolvla_multiframe.py
from policies.smolvla_multiframe import MultiFrameSmolVLAPolicy
policy = MultiFrameSmolVLAPolicy.from_pretrained("mickeykang/smolvla-multiframe-DOM")
policy.eval().cuda()
# observation images must be (B, T=2, C, H, W) per camera (frames t-2 and t),
# matching DELTA_TIMESTAMPS observation: [-2, 0].
```
Normalization buffers (state/action mean+std) are baked into `model.safetensors` (no inf/nan),
so no dataset is needed to load/eval.
## Code
github.com/mickeykang16/DynamicVLA β€” branch **`smolvla-multiframe-dom`**
(`policies/smolvla_multiframe.py`, `configs/smolvla.yaml`, `utils/helpers.py`).
## Notes
- **Final** checkpoint (10 epochs). Loss is deeply converged (~0.0015) but **loss does not guarantee sim
success** β€” judge by DOM sim success-rate (vs DynamicVLA and the released DynamicVLA checkpoint).
- Intermediate checkpoints (steps 40,000 / 275,581 / 427,635 / 529,689) are in git history.
- Built to test whether a VL-aligned backbone + multi-frame closes the DOM sim gap seen with DynamicVLA.