Wr3ck1Am's picture
upload LoRA-only checkpoints (steps 3500/4000/4500)
b2bc509 verified
metadata
license: mit
tags:
  - robotics
  - vision-language-action
  - lora
  - memoryvla
base_model: openvla/openvla-7b-prismatic

MemoryVLA — RealPushMultiT LoRA fine-tune

LoRA-only checkpoints from a fine-tune of MemoryVLA (siglip-224px+mx-bridge, backbone prism-dinosiglip-224px+7b, initialised from openvla/openvla-7b-prismatic step-295000) on the harrywang01/RealPushMultiT dataset (240 demos / 341 077 timesteps).

Contents

Each step-NNNNNN-epoch-EE-loss=L.LLLL.pt is a compact subset of the full training checkpoint, containing only the 40.83 M trainable parameters:

  • LoRA adapters
    • LLaMA-2-7B (LLM backbone): r=8, α=16 on q_proj, v_proj
    • SigLIP (vision): r=8, α=16 on fused qkv
    • DiT action model: r=24, α=48 on attention qkv and perceiver cross-attention q/v
    • Cognitive memory bank retrieval cross-attn: r=24, α=48 on q_proj / k_proj / v_proj (with lora_cog_gate=True)
  • modules_to_save (full small modules, trained outright)
    • action_model: x_embedder, t_embedder, z_embedder, final_layer
    • cog_mem_bank: timestep_encoder
    • per_mem_bank: entire module
    • per_compr (BottleneckSE): entire module

Each file is ~163 MB (fp32). The full original checkpoint was ~33.5 GB; the frozen base weights (LLaMA + SigLIP + DINOv2 + projector + non-trainable linears) are not redistributed and must be loaded from openvla/openvla-7b-prismatic.

File layout matches the training-time save format:

state = torch.load(path, map_location="cpu", weights_only=False)
# state == {"model": {"per_compr": {...}, "cog_mem_bank": {...}, ...}}

To merge back into a freshly built MemoryVLA, load the full base checkpoint first, then state_dict.update() each submodule with the matching keys from this file.

Training

  • per_device_bs=12 × grad_accum=4 × 2 GPUs → global_bs=96
  • max_steps=60 000 (LR=3e-4, sqrt-scaled from 2e-4 @ bs=32; cosine decay after 3 000 warmup steps)
  • save_interval=500
  • Instruction (constant per episode): "Push the T-shaped block to visit three different target locations on the tabletop, without visiting the same target more than once"

Hardware: 2× H100 80GB SXM5 (NVLink).