SmolVLA-MultiFrame on DOM — step 275,581 (~epoch 3.7)

Multi-frame SmolVLA fine-tuned on the DOM (Dynamic Object Manipulation) dataset. Root holds the latest checkpoint: global step / batch 275,581 (~epoch 3.7 of DOM, loss ≈ 0.0035).

What this is

Backbone: lerobot/smolvla_base — SmolVLM2-500M-Video-Instruct (SigLIP vision + SmolLM2)
- flow-matching action expert. VL-aligned + robot-pretrained.
Training: full fine-tune (403M / 450M trainable, vision encoder unfrozen) on hzxie/DOM (Franka, cameras opst_cam + wrist_cam, state 6-d, action 7-d, chunk 50).
Multi-frame: temporal window {t-2, t} (DELTA_TIMESTAMPS observation: [-2, 0]) — each frame is fed to SmolVLM2 as a separate image so the model perceives object motion (DOM is dynamic).
Setup: 8×H200, global batch 640 (40 × grad_accum 2 × 8), AdamW lr 1e-4, cosine + 1000 warmup, bf16.

⚠️ Important — load with MultiFrameSmolVLAPolicy

config.json has type: "smolvla", but this checkpoint was trained to consume two frames per camera. Loading it with the stock SmolVLAPolicy uses only the last frame (single-frame) and loses the multi-frame behavior. For correct inference use MultiFrameSmolVLAPolicy and feed a 2-frame window:

# from the repo branch below: policies/smolvla_multiframe.py
from policies.smolvla_multiframe import MultiFrameSmolVLAPolicy
policy = MultiFrameSmolVLAPolicy.from_pretrained("mickeykang/smolvla-multiframe-DOM")
policy.eval().cuda()
# observation images must be (B, T=2, C, H, W) per camera (frames t-2 and t),
# matching DELTA_TIMESTAMPS observation: [-2, 0].

Normalization buffers (state/action mean+std) are baked into model.safetensors (no inf/nan), so no dataset is needed to load/eval.

Code

github.com/mickeykang16/DynamicVLA — branch smolvla-multiframe-dom (policies/smolvla_multiframe.py, configs/smolvla.yaml, utils/helpers.py).

Notes

Mid-training checkpoint (training was still ongoing at ~~epoch 4/500). Loss is deeply converged (~~0.0035) but loss does not guarantee sim success — judge by DOM sim success-rate.
An earlier step-40,000 checkpoint previously occupied root; recoverable from git history.
Built to test whether a VL-aligned backbone + multi-frame closes the DOM sim gap seen with DynamicVLA.

Downloads last month: 11

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics