Model Card for DynamicVLA

DynamicVLA is a vision-language-action model for dynamic object manipulation. It is designed to handle dynamic scenes that require fast perception, temporal anticipation, and continuous control.

This model is trained and evaluated using the official DynamicVLA codebase. For full setup, training, and benchmarking instructions, please refer to the repository README.

How to Get Started with the Model

For a complete walkthrough, see the official DynamicVLA repository. Below is the short version for training and running inference/evaluation.

Train from scratch

From the PROJECT_ROOT/dynamic-vla directory, run:

torchrun --nnodes=1 --nproc_per_node=8 --standalone run.py \
  -c configs/dynamicvla.yaml \
  -d hzxie/DOM

Evaluate the policy / run inference

# 1. start evaluation server
python3 simulations/evaluate.py \
  --scene_dir ../scenes \
  --output_dir ../output/evaluation \
  --env_cfg ../test-envs.txt \
  --enable_cameras --headless -n 20 --save

# 2. run policy inference
python3 scripts/inference.py \
  -p /path/to/vla-checkpoint \
  -r euler -d -s

Downloads last month: 55

Video Preview

Robotics

Model tree for hzxie/dynamic-vla-DOM

Base model

HuggingFaceTB/SmolLM2-360M

Finetuned

(101)

this model

Dataset used to train hzxie/dynamic-vla-DOM

Paper for hzxie/dynamic-vla-DOM

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Paper • 2601.22153 • Published Jan 29 • 75