Refresh root to latest checkpoint (~epoch 18, batch624988)

0431b1c verified 17 days ago

2.98 kB

	---
	license: other
	license_name: ntu-s-lab-license-1.0
	license_link: https://github.com/hzxie/DynamicVLA/blob/master/LICENSE
	tags:
	- robotics
	- vision-language-action
	- vla
	- dynamicvla
	- flow-matching
	- manipulation
	datasets:
	- hzxie/DOM
	---

	# DynamicVLA — DOM (full fine-tune checkpoint)

	A [DynamicVLA](https://github.com/hzxie/DynamicVLA) policy trained on the DOM dataset
	([hzxie/DOM](https://huggingface.co/datasets/hzxie/DOM)) for dynamic-object manipulation.

	> ⚠️ Mid-training checkpoint (~epoch 18, train loss ≈ 0.0007–0.003). Self-contained and
	> eval-ready (includes normalization buffers), but optimizer/scheduler state is not included
	> (cannot resume optimizer momentum from this file).

	## Files in this repo
	- `model.safetensors` + `config.json` (root) — latest checkpoint (~epoch 18, a mid-epoch step
	snapshot, refreshed as training proceeds).
	- `epoch0005/`, `epoch0010/` — clean epoch-milestone checkpoints (saved at the end of those
	epochs; load with `subfolder="epoch0005"` etc.). Note the folder name uses the internal
	`epoch_idx`, which equals the log's "Epoch N+1" (e.g. `epoch0010` = the completed "Epoch 11").

	## Model

	- Architecture: DynamicVLA = `SmolLM2-360M` VLM backbone (16 layers) + FastViT vision encoder
	+ flow-matching action expert (cross-attention bridge, temporal-attention fusion).
	- Full fine-tune: vision, text, and connector are unfrozen (`freeze_* = False`) → all
	430M parameters trainable (the stock config freezes the backbone and trains only ~99M).
	- Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384×384,
	cameras `opst_cam` + `wrist_cam`.

	## Training

	- Hardware: 8× NVIDIA H200.
	- Effective global batch 1280 = 80/GPU × 8 GPUs × grad-accum 2 (matches the paper's
	effective batch; the paper used 32× A100 × 40/GPU = 1280).
	- AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
	- This run does only the paper's mid-training stage on DOM (no COYO vision-language
	pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init.

	## Load / evaluate

	```python
	from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
	policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM") # latest (~epoch 18)
	# policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010")
	policy.eval().cuda()
	```

	`from_pretrained` restores the normalization buffers from `model.safetensors`, so no dataset is
	needed to load/infer. For the DOM benchmark, serve with `scripts/inference.py -p <dir>` against the
	Isaac Lab `simulations/evaluate.py` eval server.

	## Notes
	- DOM contains some corrupt/truncated videos; a local `utils/datasets.py` resilience patch
	(substitute a valid sample on any decode error) is needed to train on the full set, but not
	to load/eval this checkpoint.