metadata
license: other
license_name: ntu-s-lab-license-1.0
license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE
tags:
- robotics
- vision-language-action
- vla
- dynamicvla
- flow-matching
- manipulation
datasets:
- hzxie/DOM
DynamicVLA — DOM (full fine-tune checkpoint)
A DynamicVLA policy trained on the DOM dataset (hzxie/DOM) for dynamic-object manipulation.
⚠️ Mid-training checkpoint (~epoch 18, train loss ≈ 0.0007–0.003). Self-contained and eval-ready (includes normalization buffers), but optimizer/scheduler state is not included (cannot resume optimizer momentum from this file).
Files in this repo
model.safetensors+config.json(root) — latest checkpoint (~epoch 18, a mid-epoch step snapshot, refreshed as training proceeds).epoch0005/,epoch0010/— clean epoch-milestone checkpoints (saved at the end of those epochs; load withsubfolder="epoch0005"etc.). Note the folder name uses the internalepoch_idx, which equals the log's "Epoch N+1" (e.g.epoch0010= the completed "Epoch 11").
Model
- Architecture: DynamicVLA =
SmolLM2-360MVLM backbone (16 layers) + FastViT vision encoder- flow-matching action expert (cross-attention bridge, temporal-attention fusion).
- Full fine-tune: vision, text, and connector are unfrozen (
freeze_* = False) → all 430M parameters trainable (the stock config freezes the backbone and trains only ~99M). - Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384×384,
cameras
opst_cam+wrist_cam.
Training
- Hardware: 8× NVIDIA H200.
- Effective global batch 1280 = 80/GPU × 8 GPUs × grad-accum 2 (matches the paper's effective batch; the paper used 32× A100 × 40/GPU = 1280).
- AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
- This run does only the paper's mid-training stage on DOM (no COYO vision-language pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init.
Load / evaluate
from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM") # latest (~epoch 18)
# policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010")
policy.eval().cuda()
from_pretrained restores the normalization buffers from model.safetensors, so no dataset is
needed to load/infer. For the DOM benchmark, serve with scripts/inference.py -p <dir> against the
Isaac Lab simulations/evaluate.py eval server.
Notes
- DOM contains some corrupt/truncated videos; a local
utils/datasets.pyresilience patch (substitute a valid sample on any decode error) is needed to train on the full set, but not to load/eval this checkpoint.