| --- |
| license: other |
| license_name: ntu-s-lab-license-1.0 |
| license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE |
| tags: |
| - robotics |
| - vision-language-action |
| - vla |
| - dynamicvla |
| - flow-matching |
| - manipulation |
| datasets: |
| - hzxie/DOM |
| --- |
| |
| # DynamicVLA β DOM (full fine-tune checkpoint) |
|
|
| A [DynamicVLA](https://github.com/hzxie/DynamicVLA) policy trained on the **DOM** dataset |
| ([hzxie/DOM](https://huggingface.co/datasets/hzxie/DOM)) for dynamic-object manipulation. |
|
|
| > β οΈ **Mid-training checkpoint** (~epoch 18, train loss β 0.0007β0.003). Self-contained and |
| > eval-ready (includes normalization buffers), but optimizer/scheduler state is **not** included |
| > (cannot resume optimizer momentum from this file). |
|
|
| ## Files in this repo |
| - `model.safetensors` + `config.json` (root) β **latest** checkpoint (~epoch 18, a mid-epoch step |
| snapshot, refreshed as training proceeds). |
| - `epoch0005/`, `epoch0010/` β clean **epoch-milestone** checkpoints (saved at the end of those |
| epochs; load with `subfolder="epoch0005"` etc.). Note the folder name uses the internal |
| `epoch_idx`, which equals the log's "Epoch N+1" (e.g. `epoch0010` = the completed "Epoch 11"). |
|
|
| ## Model |
|
|
| - **Architecture:** DynamicVLA = `SmolLM2-360M` VLM backbone (16 layers) + FastViT vision encoder |
| + flow-matching action expert (cross-attention bridge, temporal-attention fusion). |
| - **Full fine-tune:** vision, text, and connector are **unfrozen** (`freeze_* = False`) β all |
| **430M parameters trainable** (the stock config freezes the backbone and trains only ~99M). |
| - Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384Γ384, |
| cameras `opst_cam` + `wrist_cam`. |
|
|
| ## Training |
|
|
| - **Hardware:** 8Γ NVIDIA H200. |
| - **Effective global batch 1280** = 80/GPU Γ 8 GPUs Γ grad-accum 2 (matches the paper's |
| effective batch; the paper used 32Γ A100 Γ 40/GPU = 1280). |
| - AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps. |
| - This run does only the paper's **mid-training** stage on DOM (no COYO vision-language |
| pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init. |
|
|
| ## Load / evaluate |
|
|
| ```python |
| from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy |
| policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM") # latest (~epoch 18) |
| # policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010") |
| policy.eval().cuda() |
| ``` |
|
|
| `from_pretrained` restores the normalization buffers from `model.safetensors`, so no dataset is |
| needed to load/infer. For the DOM benchmark, serve with `scripts/inference.py -p <dir>` against the |
| Isaac Lab `simulations/evaluate.py` eval server. |
|
|
| ## Notes |
| - DOM contains some corrupt/truncated videos; a local `utils/datasets.py` resilience patch |
| (substitute a valid sample on any decode error) is needed to **train** on the full set, but not |
| to load/eval this checkpoint. |
|
|