mickeykang
/

smolvla-multiframe-DOM

Model card Files Files and versions

smolvla-multiframe-DOM / README.md

mickeykang's picture

Final checkpoint: 10 epochs (step 735270)

b65b4ad verified 5 days ago

|

History Blame Contribute Delete

2.61 kB

	---
	license: apache-2.0
	library_name: lerobot
	tags:
	- robotics
	- vla
	- smolvla
	- flow-matching
	- dom
	---

	# SmolVLA-MultiFrame on DOM — final, 10 epochs (step 735,270)

	Multi-frame SmolVLA fine-tuned on the DOM (Dynamic Object Manipulation) dataset.
	Root holds the final checkpoint: 10 full epochs of DOM (step 735,270 = 10 × 73,527), final loss ≈ 0.0015.
	Training completed and auto-stopped at the 10-epoch target.

	## What this is
	- Backbone: `lerobot/smolvla_base` — SmolVLM2-500M-Video-Instruct (SigLIP vision + SmolLM2)
	+ flow-matching action expert. VL-aligned + robot-pretrained.
	- Training: full fine-tune (403M / 450M trainable, vision encoder unfrozen) on `hzxie/DOM`
	(Franka, cameras `opst_cam` + `wrist_cam`, state 6-d, action 7-d, chunk 50).
	- Multi-frame: temporal window {t-2, t} (`DELTA_TIMESTAMPS observation: [-2, 0]`) — each frame is
	fed to SmolVLM2 as a separate image so the model perceives object motion (DOM is dynamic).
	- Setup: 8×H200, global batch 640 (40 × grad_accum 2 × 8), AdamW lr 1e-4, cosine + 1000 warmup, bf16.
	~12 days wall-clock for 10 epochs.

	## ⚠️ Important — load with MultiFrameSmolVLAPolicy
	`config.json` has `type: "smolvla"`, but this checkpoint was trained to consume two frames per camera.
	Loading it with the stock `SmolVLAPolicy` uses only the last frame (single-frame) and loses the
	multi-frame behavior. For correct inference use `MultiFrameSmolVLAPolicy` and feed a 2-frame window:

	```python
	# from the repo branch below: policies/smolvla_multiframe.py
	from policies.smolvla_multiframe import MultiFrameSmolVLAPolicy
	policy = MultiFrameSmolVLAPolicy.from_pretrained("mickeykang/smolvla-multiframe-DOM")
	policy.eval().cuda()
	# observation images must be (B, T=2, C, H, W) per camera (frames t-2 and t),
	# matching DELTA_TIMESTAMPS observation: [-2, 0].
	```

	Normalization buffers (state/action mean+std) are baked into `model.safetensors` (no inf/nan),
	so no dataset is needed to load/eval.

	## Code
	github.com/mickeykang16/DynamicVLA — branch `smolvla-multiframe-dom`
	(`policies/smolvla_multiframe.py`, `configs/smolvla.yaml`, `utils/helpers.py`).

	## Notes
	- Final checkpoint (10 epochs). Loss is deeply converged (~0.0015) but **loss does not guarantee sim
	success** — judge by DOM sim success-rate (vs DynamicVLA and the released DynamicVLA checkpoint).
	- Intermediate checkpoints (steps 40,000 / 275,581 / 427,635 / 529,689) are in git history.
	- Built to test whether a VL-aligned backbone + multi-frame closes the DOM sim gap seen with DynamicVLA.