upload LoRA-only checkpoints (steps 3500/4000/4500)

b2bc509 verified about 2 hours ago

2.23 kB

	---
	license: mit
	tags:
	- robotics
	- vision-language-action
	- lora
	- memoryvla
	base_model: openvla/openvla-7b-prismatic
	---

	# MemoryVLA — RealPushMultiT LoRA fine-tune

	LoRA-only checkpoints from a fine-tune of MemoryVLA (`siglip-224px+mx-bridge`,
	backbone `prism-dinosiglip-224px+7b`, initialised from
	`openvla/openvla-7b-prismatic` step-295000) on the `harrywang01/RealPushMultiT`
	dataset (240 demos / 341 077 timesteps).

	## Contents

	Each `step-NNNNNN-epoch-EE-loss=L.LLLL.pt` is a compact subset of the full
	training checkpoint, containing only the 40.83 M trainable parameters:

	- LoRA adapters
	- LLaMA-2-7B (LLM backbone): r=8, α=16 on `q_proj`, `v_proj`
	- SigLIP (vision): r=8, α=16 on fused `qkv`
	- DiT action model: r=24, α=48 on attention `qkv` and perceiver
	cross-attention `q`/`v`
	- Cognitive memory bank retrieval cross-attn: r=24, α=48 on
	`q_proj` / `k_proj` / `v_proj` (with `lora_cog_gate=True`)
	- `modules_to_save` (full small modules, trained outright)
	- `action_model`: `x_embedder`, `t_embedder`, `z_embedder`, `final_layer`
	- `cog_mem_bank`: `timestep_encoder`
	- `per_mem_bank`: entire module
	- `per_compr` (BottleneckSE): entire module

	Each file is ~163 MB (fp32). The full original checkpoint was ~33.5 GB; the
	frozen base weights (LLaMA + SigLIP + DINOv2 + projector + non-trainable
	linears) are not redistributed and must be loaded from
	`openvla/openvla-7b-prismatic`.

	File layout matches the training-time save format:

	```python
	state = torch.load(path, map_location="cpu", weights_only=False)
	# state == {"model": {"per_compr": {...}, "cog_mem_bank": {...}, ...}}
	```

	To merge back into a freshly built MemoryVLA, load the full base checkpoint
	first, then `state_dict.update()` each submodule with the matching keys from
	this file.

	## Training

	- per_device_bs=12 × grad_accum=4 × 2 GPUs → global_bs=96
	- max_steps=60 000 (LR=3e-4, sqrt-scaled from 2e-4 @ bs=32; cosine decay after
	3 000 warmup steps)
	- save_interval=500
	- Instruction (constant per episode):
	*"Push the T-shaped block to visit three different target locations on the
	tabletop, without visiting the same target more than once"*

	Hardware: 2× H100 80GB SXM5 (NVLink).