Instructions to use RalphFH/openvla-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use RalphFH/openvla-7b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| base_model: openvla/openvla-7b | |
| library_name: peft | |
| license: mit | |
| tags: | |
| - openvla | |
| - vla | |
| - robotics | |
| - lora | |
| - bridgedata-v2 | |
| datasets: | |
| - bridge_orig | |
| # OpenVLA-7B + BridgeData V2 LoRA adapter | |
| LoRA adapter (rank 32) fine-tuned on top of [`openvla/openvla-7b`](https://huggingface.co/openvla/openvla-7b) | |
| on the **BridgeData V2** dataset (`bridge_orig` from the official Bridge V2 project website), | |
| following the standard LoRA fine-tune recipe in the [OpenVLA repo](https://github.com/openvla/openvla). | |
| ## Files | |
| - `adapter_model.safetensors` β LoRA weights (~463 MB) | |
| - `adapter_config.json` β PEFT config (`r=32`, `alpha=16`, `dropout=0.0`) | |
| - `dataset_statistics.json` β bridge_orig action normalization stats (needed by `predict_action(unnorm_key="bridge_orig")`) | |
| ## Training setup | |
| | | | | |
| |---|---| | |
| | Base model | `openvla/openvla-7b` | | |
| | Dataset | `bridge_orig` (BridgeData V2, project-website version) | | |
| | LoRA rank | 32 | | |
| | LoRA alpha | 16 | | |
| | LoRA dropout | 0.0 | | |
| | Target modules | all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping) | | |
| | Batch size | 16 per GPU | | |
| | Grad accumulation | 1 | | |
| | Effective batch | 16 Γ 8 GPUs = 128 | | |
| | Learning rate | 5e-4 | | |
| | Image augmentation | enabled (random resized crop, scale β 0.9) | | |
| | Hardware | 8Γ NVIDIA A100-SXM4-80GB | | |
| | Steps | 195,000 gradient steps (β 2.5 Γ 10β· transitions) | | |
| | Precision | bf16, FlashAttention-2 | | |
| Training command (script: `vla-scripts/finetune.py`): | |
| ```bash | |
| torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \ | |
| --vla_path openvla/openvla-7b \ | |
| --data_root_dir <path-to-rlds-data> \ | |
| --dataset_name bridge_orig \ | |
| --run_root_dir runs --adapter_tmp_dir adapter-tmp \ | |
| --lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \ | |
| --learning_rate 5e-4 --image_aug True \ | |
| --save_steps 5000 --max_steps 200000 | |
| ``` | |
| ## Quick offline evaluation | |
| On 98 frames sampled from the bridge_orig **val** split (3 episodes, open-loop teacher-forcing β no simulator), per-dimension MAE was: | |
| | dim | dx | dy | dz | dRoll | dPitch | dYaw | gripper | | |
| |---|---|---|---|---|---|---|---| | |
| | MAE | 0.004 | 0.007 | 0.007 | 0.033 | 0.041 | 0.040 | 0.053 | | |
| For context, bridge_orig action `q99` magnitudes are roughly `~3e-2` for translation, `~0.1β0.2` for rotation, and `{0,1}` for gripper. This is **single-step open-loop accuracy**, not closed-loop task success. | |
| ## Usage | |
| ```python | |
| import torch | |
| from transformers import AutoModelForVision2Seq, AutoProcessor | |
| from peft import PeftModel | |
| processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True) | |
| base = AutoModelForVision2Seq.from_pretrained( | |
| "openvla/openvla-7b", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| trust_remote_code=True, | |
| ).to("cuda") | |
| vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b") | |
| # Load action normalization statistics for predict_action | |
| import json, huggingface_hub | |
| stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json") | |
| vla.norm_stats = json.load(open(stats_path)) | |
| from PIL import Image | |
| img = Image.open("some_observation.png").convert("RGB") | |
| inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16) | |
| action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False) | |
| print(action) # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper] | |
| ``` | |
| If you prefer not to merge LoRA at inference, you can also call `vla.merge_and_unload()` first. | |
| ## License | |
| MIT (matches OpenVLA upstream). | |