Refresh root to latest checkpoint (~epoch 18, batch624988)

0431b1c verified 16 days ago

2.98 kB

license: other
license_name: ntu-s-lab-license-1.0
license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE
tags:
  - robotics
  - vision-language-action
  - vla
  - dynamicvla
  - flow-matching
  - manipulation
datasets:
  - hzxie/DOM

DynamicVLA — DOM (full fine-tune checkpoint)

A DynamicVLA policy trained on the DOM dataset (hzxie/DOM) for dynamic-object manipulation.

⚠️ Mid-training checkpoint (~epoch 18, train loss ≈ 0.0007–0.003). Self-contained and eval-ready (includes normalization buffers), but optimizer/scheduler state is not included (cannot resume optimizer momentum from this file).

Files in this repo

model.safetensors + config.json (root) — latest checkpoint (~epoch 18, a mid-epoch step snapshot, refreshed as training proceeds).
epoch0005/, epoch0010/ — clean epoch-milestone checkpoints (saved at the end of those epochs; load with subfolder="epoch0005" etc.). Note the folder name uses the internal epoch_idx, which equals the log's "Epoch N+1" (e.g. epoch0010 = the completed "Epoch 11").

Model

Architecture: DynamicVLA = SmolLM2-360M VLM backbone (16 layers) + FastViT vision encoder
- flow-matching action expert (cross-attention bridge, temporal-attention fusion).
Full fine-tune: vision, text, and connector are unfrozen (freeze_* = False) → all 430M parameters trainable (the stock config freezes the backbone and trains only ~99M).
Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384×384, cameras opst_cam + wrist_cam.

Training

Hardware: 8× NVIDIA H200.
Effective global batch 1280 = 80/GPU × 8 GPUs × grad-accum 2 (matches the paper's effective batch; the paper used 32× A100 × 40/GPU = 1280).
AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
This run does only the paper's mid-training stage on DOM (no COYO vision-language pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init.

Load / evaluate

from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM")               # latest (~epoch 18)
# policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010")
policy.eval().cuda()

from_pretrained restores the normalization buffers from model.safetensors, so no dataset is needed to load/infer. For the DOM benchmark, serve with scripts/inference.py -p <dir> against the Isaac Lab simulations/evaluate.py eval server.

Notes

DOM contains some corrupt/truncated videos; a local utils/datasets.py resilience patch (substitute a valid sample on any decode error) is needed to train on the full set, but not to load/eval this checkpoint.