---
license: other
license_name: ntu-s-lab-license-1.0
license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE
tags:
  - robotics
  - vision-language-action
  - vla
  - dynamicvla
  - flow-matching
  - manipulation
datasets:
  - hzxie/DOM
---

# DynamicVLA — DOM (full fine-tune checkpoint)

A [DynamicVLA](https://github.com/hzxie/DynamicVLA) policy trained on the **DOM** dataset
([hzxie/DOM](https://huggingface.co/datasets/hzxie/DOM)) for dynamic-object manipulation.

> ⚠️ **Mid-training checkpoint** (~epoch 18, train loss ≈ 0.0007–0.003). Self-contained and
> eval-ready (includes normalization buffers), but optimizer/scheduler state is **not** included
> (cannot resume optimizer momentum from this file).

## Files in this repo
- `model.safetensors` + `config.json` (root) — **latest** checkpoint (~epoch 18, a mid-epoch step
  snapshot, refreshed as training proceeds).
- `epoch0005/`, `epoch0010/` — clean **epoch-milestone** checkpoints (saved at the end of those
  epochs; load with `subfolder="epoch0005"` etc.). Note the folder name uses the internal
  `epoch_idx`, which equals the log's "Epoch N+1" (e.g. `epoch0010` = the completed "Epoch 11").

## Model

- **Architecture:** DynamicVLA = `SmolLM2-360M` VLM backbone (16 layers) + FastViT vision encoder
  + flow-matching action expert (cross-attention bridge, temporal-attention fusion).
- **Full fine-tune:** vision, text, and connector are **unfrozen** (`freeze_* = False`) → all
  **430M parameters trainable** (the stock config freezes the backbone and trains only ~99M).
- Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384×384,
  cameras `opst_cam` + `wrist_cam`.

## Training

- **Hardware:** 8× NVIDIA H200.
- **Effective global batch 1280** = 80/GPU × 8 GPUs × grad-accum 2 (matches the paper's
  effective batch; the paper used 32× A100 × 40/GPU = 1280).
- AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
- This run does only the paper's **mid-training** stage on DOM (no COYO vision-language
  pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init.

## Load / evaluate

```python
from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM")               # latest (~epoch 18)
# policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010")
policy.eval().cuda()
```

`from_pretrained` restores the normalization buffers from `model.safetensors`, so no dataset is
needed to load/infer. For the DOM benchmark, serve with `scripts/inference.py -p <dir>` against the
Isaac Lab `simulations/evaluate.py` eval server.

## Notes
- DOM contains some corrupt/truncated videos; a local `utils/datasets.py` resilience patch
  (substitute a valid sample on any decode error) is needed to **train** on the full set, but not
  to load/eval this checkpoint.