Upload bridge_orig LoRA adapter (r=32, 195k steps)

0ae8315 verified 26 days ago

3.58 kB

	---
	base_model: openvla/openvla-7b
	library_name: peft
	license: mit
	tags:
	- openvla
	- vla
	- robotics
	- lora
	- bridgedata-v2
	datasets:
	- bridge_orig
	---

	# OpenVLA-7B + BridgeData V2 LoRA adapter

	LoRA adapter (rank 32) fine-tuned on top of [`openvla/openvla-7b`](https://huggingface.co/openvla/openvla-7b)
	on the BridgeData V2 dataset (`bridge_orig` from the official Bridge V2 project website),
	following the standard LoRA fine-tune recipe in the [OpenVLA repo](https://github.com/openvla/openvla).

	## Files

	- `adapter_model.safetensors` — LoRA weights (~463 MB)
	- `adapter_config.json` — PEFT config (`r=32`, `alpha=16`, `dropout=0.0`)
	- `dataset_statistics.json` — bridge_orig action normalization stats (needed by `predict_action(unnorm_key="bridge_orig")`)

	## Training setup

	\| \| \|
	\|---\|---\|
	\| Base model \| `openvla/openvla-7b` \|
	\| Dataset \| `bridge_orig` (BridgeData V2, project-website version) \|
	\| LoRA rank \| 32 \|
	\| LoRA alpha \| 16 \|
	\| LoRA dropout \| 0.0 \|
	\| Target modules \| all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping) \|
	\| Batch size \| 16 per GPU \|
	\| Grad accumulation \| 1 \|
	\| Effective batch \| 16 × 8 GPUs = 128 \|
	\| Learning rate \| 5e-4 \|
	\| Image augmentation \| enabled (random resized crop, scale ≈ 0.9) \|
	\| Hardware \| 8× NVIDIA A100-SXM4-80GB \|
	\| Steps \| 195,000 gradient steps (≈ 2.5 × 10⁷ transitions) \|
	\| Precision \| bf16, FlashAttention-2 \|

	Training command (script: `vla-scripts/finetune.py`):

	```bash
	torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
	--vla_path openvla/openvla-7b \
	--data_root_dir <path-to-rlds-data> \
	--dataset_name bridge_orig \
	--run_root_dir runs --adapter_tmp_dir adapter-tmp \
	--lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \
	--learning_rate 5e-4 --image_aug True \
	--save_steps 5000 --max_steps 200000
	```

	## Quick offline evaluation

	On 98 frames sampled from the bridge_orig val split (3 episodes, open-loop teacher-forcing — no simulator), per-dimension MAE was:

	\| dim \| dx \| dy \| dz \| dRoll \| dPitch \| dYaw \| gripper \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| MAE \| 0.004 \| 0.007 \| 0.007 \| 0.033 \| 0.041 \| 0.040 \| 0.053 \|

	For context, bridge_orig action `q99` magnitudes are roughly `~3e-2` for translation, `~0.1–0.2` for rotation, and `{0,1}` for gripper. This is single-step open-loop accuracy, not closed-loop task success.

	## Usage

	```python
	import torch
	from transformers import AutoModelForVision2Seq, AutoProcessor
	from peft import PeftModel

	processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
	base = AutoModelForVision2Seq.from_pretrained(
	"openvla/openvla-7b",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	trust_remote_code=True,
	).to("cuda")
	vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b")

	# Load action normalization statistics for predict_action
	import json, huggingface_hub
	stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json")
	vla.norm_stats = json.load(open(stats_path))

	from PIL import Image
	img = Image.open("some_observation.png").convert("RGB")
	inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16)
	action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
	print(action) # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper]
	```

	If you prefer not to merge LoRA at inference, you can also call `vla.merge_and_unload()` first.

	## License

	MIT (matches OpenVLA upstream).