Instructions to use mickeykang/smolvla-multiframe-DOM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use mickeykang/smolvla-multiframe-DOM with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=mickeykang/smolvla-multiframe-DOM \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function python -m lerobot.record \ --robot.type=so101_follower \ --robot.port=/dev/ttyACM0 \ # <- Use your port --robot.id=my_blue_follower_arm \ # <- Use your robot id --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording --dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub --dataset.episode_time_s=50 \ --dataset.num_episodes=10 \ --policy.path=mickeykang/smolvla-multiframe-DOM - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: lerobot | |
| tags: | |
| - robotics | |
| - vla | |
| - smolvla | |
| - flow-matching | |
| - dom | |
| # SmolVLA-MultiFrame on DOM β final, 10 epochs (step 735,270) | |
| Multi-frame **SmolVLA** fine-tuned on the **DOM** (Dynamic Object Manipulation) dataset. | |
| Root holds the **final** checkpoint: **10 full epochs** of DOM (step 735,270 = 10 Γ 73,527), final loss β 0.0015. | |
| Training completed and auto-stopped at the 10-epoch target. | |
| ## What this is | |
| - **Backbone:** `lerobot/smolvla_base` β SmolVLM2-500M-Video-Instruct (SigLIP vision + SmolLM2) | |
| + flow-matching action expert. VL-aligned + robot-pretrained. | |
| - **Training:** **full fine-tune** (403M / 450M trainable, vision encoder unfrozen) on `hzxie/DOM` | |
| (Franka, cameras `opst_cam` + `wrist_cam`, state 6-d, action 7-d, chunk 50). | |
| - **Multi-frame:** temporal window **{t-2, t}** (`DELTA_TIMESTAMPS observation: [-2, 0]`) β each frame is | |
| fed to SmolVLM2 as a **separate image** so the model perceives object motion (DOM is dynamic). | |
| - **Setup:** 8ΓH200, global batch 640 (40 Γ grad_accum 2 Γ 8), AdamW lr 1e-4, cosine + 1000 warmup, bf16. | |
| ~12 days wall-clock for 10 epochs. | |
| ## β οΈ Important β load with MultiFrameSmolVLAPolicy | |
| `config.json` has `type: "smolvla"`, but this checkpoint was trained to consume **two frames per camera**. | |
| Loading it with the stock `SmolVLAPolicy` uses **only the last frame** (single-frame) and loses the | |
| multi-frame behavior. For correct inference use **`MultiFrameSmolVLAPolicy`** and feed a 2-frame window: | |
| ```python | |
| # from the repo branch below: policies/smolvla_multiframe.py | |
| from policies.smolvla_multiframe import MultiFrameSmolVLAPolicy | |
| policy = MultiFrameSmolVLAPolicy.from_pretrained("mickeykang/smolvla-multiframe-DOM") | |
| policy.eval().cuda() | |
| # observation images must be (B, T=2, C, H, W) per camera (frames t-2 and t), | |
| # matching DELTA_TIMESTAMPS observation: [-2, 0]. | |
| ``` | |
| Normalization buffers (state/action mean+std) are baked into `model.safetensors` (no inf/nan), | |
| so no dataset is needed to load/eval. | |
| ## Code | |
| github.com/mickeykang16/DynamicVLA β branch **`smolvla-multiframe-dom`** | |
| (`policies/smolvla_multiframe.py`, `configs/smolvla.yaml`, `utils/helpers.py`). | |
| ## Notes | |
| - **Final** checkpoint (10 epochs). Loss is deeply converged (~0.0015) but **loss does not guarantee sim | |
| success** β judge by DOM sim success-rate (vs DynamicVLA and the released DynamicVLA checkpoint). | |
| - Intermediate checkpoints (steps 40,000 / 275,581 / 427,635 / 529,689) are in git history. | |
| - Built to test whether a VL-aligned backbone + multi-frame closes the DOM sim gap seen with DynamicVLA. | |