--- base_model: openvla/openvla-7b library_name: peft license: mit tags: - openvla - vla - robotics - lora - bridgedata-v2 datasets: - bridge_orig --- # OpenVLA-7B + BridgeData V2 LoRA adapter LoRA adapter (rank 32) fine-tuned on top of [`openvla/openvla-7b`](https://huggingface.co/openvla/openvla-7b) on the **BridgeData V2** dataset (`bridge_orig` from the official Bridge V2 project website), following the standard LoRA fine-tune recipe in the [OpenVLA repo](https://github.com/openvla/openvla). ## Files - `adapter_model.safetensors` — LoRA weights (~463 MB) - `adapter_config.json` — PEFT config (`r=32`, `alpha=16`, `dropout=0.0`) - `dataset_statistics.json` — bridge_orig action normalization stats (needed by `predict_action(unnorm_key="bridge_orig")`) ## Training setup | | | |---|---| | Base model | `openvla/openvla-7b` | | Dataset | `bridge_orig` (BridgeData V2, project-website version) | | LoRA rank | 32 | | LoRA alpha | 16 | | LoRA dropout | 0.0 | | Target modules | all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping) | | Batch size | 16 per GPU | | Grad accumulation | 1 | | Effective batch | 16 × 8 GPUs = 128 | | Learning rate | 5e-4 | | Image augmentation | enabled (random resized crop, scale ≈ 0.9) | | Hardware | 8× NVIDIA A100-SXM4-80GB | | Steps | 195,000 gradient steps (≈ 2.5 × 10⁷ transitions) | | Precision | bf16, FlashAttention-2 | Training command (script: `vla-scripts/finetune.py`): ```bash torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \ --vla_path openvla/openvla-7b \ --data_root_dir \ --dataset_name bridge_orig \ --run_root_dir runs --adapter_tmp_dir adapter-tmp \ --lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \ --learning_rate 5e-4 --image_aug True \ --save_steps 5000 --max_steps 200000 ``` ## Quick offline evaluation On 98 frames sampled from the bridge_orig **val** split (3 episodes, open-loop teacher-forcing — no simulator), per-dimension MAE was: | dim | dx | dy | dz | dRoll | dPitch | dYaw | gripper | |---|---|---|---|---|---|---|---| | MAE | 0.004 | 0.007 | 0.007 | 0.033 | 0.041 | 0.040 | 0.053 | For context, bridge_orig action `q99` magnitudes are roughly `~3e-2` for translation, `~0.1–0.2` for rotation, and `{0,1}` for gripper. This is **single-step open-loop accuracy**, not closed-loop task success. ## Usage ```python import torch from transformers import AutoModelForVision2Seq, AutoProcessor from peft import PeftModel processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True) base = AutoModelForVision2Seq.from_pretrained( "openvla/openvla-7b", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", trust_remote_code=True, ).to("cuda") vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b") # Load action normalization statistics for predict_action import json, huggingface_hub stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json") vla.norm_stats = json.load(open(stats_path)) from PIL import Image img = Image.open("some_observation.png").convert("RGB") inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16) action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False) print(action) # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper] ``` If you prefer not to merge LoRA at inference, you can also call `vla.merge_and_unload()` first. ## License MIT (matches OpenVLA upstream).