# OpenVLA-OFT -- color_object Checkpoint

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.

Paper: https://arxiv.org/abs/2502.19645
Project: https://openvla-oft.github.io/

## Repository Structure

```
checkpoints/
  color_object/
    model-0000{1..4}-of-00004.safetensors   # merged LLM weights (step 50000)
    action_head--50000_checkpoint.pt        # MLP action head
    proprio_projector--50000_checkpoint.pt  # proprio projector
    config.json / tokenizer* / ...          # model config and tokenizer files
    lora_adapter/
      adapter_model.safetensors             # LoRA adapter weights
      adapter_config.json
prismatic/         # model architecture, dataset, training code
vla-scripts/       # finetune.py, deploy.py, merge_lora_weights_and_save.py
experiments/       # eval scripts for LIBERO, ALOHA
slurm_scripts/     # SLURM finetune scripts for all conflict splits
finetune_color_object.sh   # exact script used to produce the checkpoint
finetune.md        # step-by-step fine-tuning guide
SETUP.md / LIBERO.md / ALOHA.md
```

## Quick Inference

See `finetune.md` for the full loading example.

```python
from experiments.robot.openvla_utils import get_vla, get_processor, get_action_head, get_proprio_projector, get_vla_action
from experiments.robot.libero.run_libero_eval import GenerateConfig
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM

cfg = GenerateConfig(
    pretrained_checkpoint="checkpoints/color_object",
    use_l1_regression=True,
    use_film=False,
    num_images_in_input=2,
    use_proprio=True,
    center_crop=True,
    num_open_loop_steps=NUM_ACTIONS_CHUNK,
    unnorm_key="conflict_maniskill",
)
vla = get_vla(cfg)
processor = get_processor(cfg)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"],
                         action_head, proprio_projector)
```

## Fine-tuning

See `finetune.md` for the complete fine-tuning guide.

## Citation

```bibtex
@article{kim2025openvlaoft,
  title   = {Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author  = {Kim, Moo Jin and Pertsch, Karl and Ghosh, Dibya and Walke, Homer and
             Bahl, Shikhar and Levine, Sergey and Finn, Chelsea},
  journal = {arXiv preprint arXiv:2502.19645},
  year    = {2025}
}
```