File size: 3,581 Bytes
0ae8315
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
base_model: openvla/openvla-7b
library_name: peft
license: mit
tags:
  - openvla
  - vla
  - robotics
  - lora
  - bridgedata-v2
datasets:
  - bridge_orig
---

# OpenVLA-7B + BridgeData V2 LoRA adapter

LoRA adapter (rank 32) fine-tuned on top of [`openvla/openvla-7b`](https://huggingface.co/openvla/openvla-7b)
on the **BridgeData V2** dataset (`bridge_orig` from the official Bridge V2 project website),
following the standard LoRA fine-tune recipe in the [OpenVLA repo](https://github.com/openvla/openvla).

## Files

- `adapter_model.safetensors` — LoRA weights (~463 MB)
- `adapter_config.json` — PEFT config (`r=32`, `alpha=16`, `dropout=0.0`)
- `dataset_statistics.json` — bridge_orig action normalization stats (needed by `predict_action(unnorm_key="bridge_orig")`)

## Training setup

| | |
|---|---|
| Base model | `openvla/openvla-7b` |
| Dataset | `bridge_orig` (BridgeData V2, project-website version) |
| LoRA rank | 32 |
| LoRA alpha | 16 |
| LoRA dropout | 0.0 |
| Target modules | all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping) |
| Batch size | 16 per GPU |
| Grad accumulation | 1 |
| Effective batch | 16 × 8 GPUs = 128 |
| Learning rate | 5e-4 |
| Image augmentation | enabled (random resized crop, scale ≈ 0.9) |
| Hardware | 8× NVIDIA A100-SXM4-80GB |
| Steps | 195,000 gradient steps (≈ 2.5 × 10⁷ transitions) |
| Precision | bf16, FlashAttention-2 |

Training command (script: `vla-scripts/finetune.py`):

```bash
torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir <path-to-rlds-data> \
  --dataset_name bridge_orig \
  --run_root_dir runs --adapter_tmp_dir adapter-tmp \
  --lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \
  --learning_rate 5e-4 --image_aug True \
  --save_steps 5000 --max_steps 200000
```

## Quick offline evaluation

On 98 frames sampled from the bridge_orig **val** split (3 episodes, open-loop teacher-forcing — no simulator), per-dimension MAE was:

| dim | dx | dy | dz | dRoll | dPitch | dYaw | gripper |
|---|---|---|---|---|---|---|---|
| MAE | 0.004 | 0.007 | 0.007 | 0.033 | 0.041 | 0.040 | 0.053 |

For context, bridge_orig action `q99` magnitudes are roughly `~3e-2` for translation, `~0.1–0.2` for rotation, and `{0,1}` for gripper. This is **single-step open-loop accuracy**, not closed-loop task success.

## Usage

```python
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
base = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
).to("cuda")
vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b")

# Load action normalization statistics for predict_action
import json, huggingface_hub
stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json")
vla.norm_stats = json.load(open(stats_path))

from PIL import Image
img = Image.open("some_observation.png").convert("RGB")
inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)  # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper]
```

If you prefer not to merge LoRA at inference, you can also call `vla.merge_and_unload()` first.

## License

MIT (matches OpenVLA upstream).