--- license: other license_name: ntu-s-lab-license-1.0 license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE tags: - robotics - vision-language-action - vla - dynamicvla - flow-matching - manipulation datasets: - hzxie/DOM --- # DynamicVLA — DOM (full fine-tune checkpoint) A [DynamicVLA](https://github.com/hzxie/DynamicVLA) policy trained on the **DOM** dataset ([hzxie/DOM](https://huggingface.co/datasets/hzxie/DOM)) for dynamic-object manipulation. > ⚠️ **Mid-training checkpoint** (~epoch 18, train loss ≈ 0.0007–0.003). Self-contained and > eval-ready (includes normalization buffers), but optimizer/scheduler state is **not** included > (cannot resume optimizer momentum from this file). ## Files in this repo - `model.safetensors` + `config.json` (root) — **latest** checkpoint (~epoch 18, a mid-epoch step snapshot, refreshed as training proceeds). - `epoch0005/`, `epoch0010/` — clean **epoch-milestone** checkpoints (saved at the end of those epochs; load with `subfolder="epoch0005"` etc.). Note the folder name uses the internal `epoch_idx`, which equals the log's "Epoch N+1" (e.g. `epoch0010` = the completed "Epoch 11"). ## Model - **Architecture:** DynamicVLA = `SmolLM2-360M` VLM backbone (16 layers) + FastViT vision encoder + flow-matching action expert (cross-attention bridge, temporal-attention fusion). - **Full fine-tune:** vision, text, and connector are **unfrozen** (`freeze_* = False`) → all **430M parameters trainable** (the stock config freezes the backbone and trains only ~99M). - Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384×384, cameras `opst_cam` + `wrist_cam`. ## Training - **Hardware:** 8× NVIDIA H200. - **Effective global batch 1280** = 80/GPU × 8 GPUs × grad-accum 2 (matches the paper's effective batch; the paper used 32× A100 × 40/GPU = 1280). - AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps. - This run does only the paper's **mid-training** stage on DOM (no COYO vision-language pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init. ## Load / evaluate ```python from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM") # latest (~epoch 18) # policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010") policy.eval().cuda() ``` `from_pretrained` restores the normalization buffers from `model.safetensors`, so no dataset is needed to load/infer. For the DOM benchmark, serve with `scripts/inference.py -p ` against the Isaac Lab `simulations/evaluate.py` eval server. ## Notes - DOM contains some corrupt/truncated videos; a local `utils/datasets.py` resilience patch (substitute a valid sample on any decode error) is needed to **train** on the full set, but not to load/eval this checkpoint.