Instructions to use mickeykang/smolvla-multiframe-DOM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use mickeykang/smolvla-multiframe-DOM with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=mickeykang/smolvla-multiframe-DOM \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function python -m lerobot.record \ --robot.type=so101_follower \ --robot.port=/dev/ttyACM0 \ # <- Use your port --robot.id=my_blue_follower_arm \ # <- Use your robot id --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording --dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub --dataset.episode_time_s=50 \ --dataset.num_episodes=10 \ --policy.path=mickeykang/smolvla-multiframe-DOM - Notebooks
- Google Colab
- Kaggle
SmolVLA-MultiFrame on DOM — step 275,581 (~epoch 3.7)
Multi-frame SmolVLA fine-tuned on the DOM (Dynamic Object Manipulation) dataset. Root holds the latest checkpoint: global step / batch 275,581 (~epoch 3.7 of DOM, loss ≈ 0.0035).
What this is
- Backbone:
lerobot/smolvla_base— SmolVLM2-500M-Video-Instruct (SigLIP vision + SmolLM2)- flow-matching action expert. VL-aligned + robot-pretrained.
- Training: full fine-tune (403M / 450M trainable, vision encoder unfrozen) on
hzxie/DOM(Franka, camerasopst_cam+wrist_cam, state 6-d, action 7-d, chunk 50). - Multi-frame: temporal window {t-2, t} (
DELTA_TIMESTAMPS observation: [-2, 0]) — each frame is fed to SmolVLM2 as a separate image so the model perceives object motion (DOM is dynamic). - Setup: 8×H200, global batch 640 (40 × grad_accum 2 × 8), AdamW lr 1e-4, cosine + 1000 warmup, bf16.
⚠️ Important — load with MultiFrameSmolVLAPolicy
config.json has type: "smolvla", but this checkpoint was trained to consume two frames per camera.
Loading it with the stock SmolVLAPolicy uses only the last frame (single-frame) and loses the
multi-frame behavior. For correct inference use MultiFrameSmolVLAPolicy and feed a 2-frame window:
# from the repo branch below: policies/smolvla_multiframe.py
from policies.smolvla_multiframe import MultiFrameSmolVLAPolicy
policy = MultiFrameSmolVLAPolicy.from_pretrained("mickeykang/smolvla-multiframe-DOM")
policy.eval().cuda()
# observation images must be (B, T=2, C, H, W) per camera (frames t-2 and t),
# matching DELTA_TIMESTAMPS observation: [-2, 0].
Normalization buffers (state/action mean+std) are baked into model.safetensors (no inf/nan),
so no dataset is needed to load/eval.
Code
github.com/mickeykang16/DynamicVLA — branch smolvla-multiframe-dom
(policies/smolvla_multiframe.py, configs/smolvla.yaml, utils/helpers.py).
Notes
- Mid-training checkpoint (training was still ongoing at
epoch 4/500). Loss is deeply converged (0.0035) but loss does not guarantee sim success — judge by DOM sim success-rate. - An earlier step-40,000 checkpoint previously occupied root; recoverable from git history.
- Built to test whether a VL-aligned backbone + multi-frame closes the DOM sim gap seen with DynamicVLA.
- Downloads last month
- 11