DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
Paper • 2601.22153 • Published • 75
DynamicVLA is a vision-language-action model for dynamic object manipulation. It is designed to handle dynamic scenes that require fast perception, temporal anticipation, and continuous control.
This model is trained and evaluated using the official DynamicVLA codebase. For full setup, training, and benchmarking instructions, please refer to the repository README.
For a complete walkthrough, see the official DynamicVLA repository. Below is the short version for training and running inference/evaluation.
From the PROJECT_ROOT/dynamic-vla directory, run:
torchrun --nnodes=1 --nproc_per_node=8 --standalone run.py \
-c configs/dynamicvla.yaml \
-d hzxie/DOM
# 1. start evaluation server
python3 simulations/evaluate.py \
--scene_dir ../scenes \
--output_dir ../output/evaluation \
--env_cfg ../test-envs.txt \
--enable_cameras --headless -n 20 --save
# 2. run policy inference
python3 scripts/inference.py \
-p /path/to/vla-checkpoint \
-r euler -d -s
Base model
HuggingFaceTB/SmolLM2-360M