dynamic-vla-DOM / README.md
mickeykang's picture
Refresh root to latest checkpoint (~epoch 18, batch624988)
0431b1c verified
|
Raw
History Blame Contribute Delete
2.98 kB
---
license: other
license_name: ntu-s-lab-license-1.0
license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE
tags:
- robotics
- vision-language-action
- vla
- dynamicvla
- flow-matching
- manipulation
datasets:
- hzxie/DOM
---
# DynamicVLA β€” DOM (full fine-tune checkpoint)
A [DynamicVLA](https://github.com/hzxie/DynamicVLA) policy trained on the **DOM** dataset
([hzxie/DOM](https://huggingface.co/datasets/hzxie/DOM)) for dynamic-object manipulation.
> ⚠️ **Mid-training checkpoint** (~epoch 18, train loss β‰ˆ 0.0007–0.003). Self-contained and
> eval-ready (includes normalization buffers), but optimizer/scheduler state is **not** included
> (cannot resume optimizer momentum from this file).
## Files in this repo
- `model.safetensors` + `config.json` (root) β€” **latest** checkpoint (~epoch 18, a mid-epoch step
snapshot, refreshed as training proceeds).
- `epoch0005/`, `epoch0010/` β€” clean **epoch-milestone** checkpoints (saved at the end of those
epochs; load with `subfolder="epoch0005"` etc.). Note the folder name uses the internal
`epoch_idx`, which equals the log's "Epoch N+1" (e.g. `epoch0010` = the completed "Epoch 11").
## Model
- **Architecture:** DynamicVLA = `SmolLM2-360M` VLM backbone (16 layers) + FastViT vision encoder
+ flow-matching action expert (cross-attention bridge, temporal-attention fusion).
- **Full fine-tune:** vision, text, and connector are **unfrozen** (`freeze_* = False`) β†’ all
**430M parameters trainable** (the stock config freezes the backbone and trains only ~99M).
- Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384Γ—384,
cameras `opst_cam` + `wrist_cam`.
## Training
- **Hardware:** 8Γ— NVIDIA H200.
- **Effective global batch 1280** = 80/GPU Γ— 8 GPUs Γ— grad-accum 2 (matches the paper's
effective batch; the paper used 32Γ— A100 Γ— 40/GPU = 1280).
- AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
- This run does only the paper's **mid-training** stage on DOM (no COYO vision-language
pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init.
## Load / evaluate
```python
from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM") # latest (~epoch 18)
# policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010")
policy.eval().cuda()
```
`from_pretrained` restores the normalization buffers from `model.safetensors`, so no dataset is
needed to load/infer. For the DOM benchmark, serve with `scripts/inference.py -p <dir>` against the
Isaac Lab `simulations/evaluate.py` eval server.
## Notes
- DOM contains some corrupt/truncated videos; a local `utils/datasets.py` resilience patch
(substitute a valid sample on any decode error) is needed to **train** on the full set, but not
to load/eval this checkpoint.