| --- |
| license: mit |
| tags: |
| - robotics |
| - vision-language-action |
| - lora |
| - memoryvla |
| base_model: openvla/openvla-7b-prismatic |
| --- |
| |
| # MemoryVLA — RealPushMultiT LoRA fine-tune |
|
|
| LoRA-only checkpoints from a fine-tune of MemoryVLA (`siglip-224px+mx-bridge`, |
| backbone `prism-dinosiglip-224px+7b`, initialised from |
| `openvla/openvla-7b-prismatic` step-295000) on the `harrywang01/RealPushMultiT` |
| dataset (240 demos / 341 077 timesteps). |
|
|
| ## Contents |
|
|
| Each `step-NNNNNN-epoch-EE-loss=L.LLLL.pt` is a compact subset of the full |
| training checkpoint, containing only the **40.83 M trainable parameters**: |
|
|
| - LoRA adapters |
| - **LLaMA-2-7B (LLM backbone)**: r=8, α=16 on `q_proj`, `v_proj` |
| - **SigLIP (vision)**: r=8, α=16 on fused `qkv` |
| - **DiT action model**: r=24, α=48 on attention `qkv` and perceiver |
| cross-attention `q`/`v` |
| - **Cognitive memory bank retrieval cross-attn**: r=24, α=48 on |
| `q_proj` / `k_proj` / `v_proj` (with `lora_cog_gate=True`) |
| - `modules_to_save` (full small modules, trained outright) |
| - `action_model`: `x_embedder`, `t_embedder`, `z_embedder`, `final_layer` |
| - `cog_mem_bank`: `timestep_encoder` |
| - `per_mem_bank`: entire module |
| - `per_compr` (BottleneckSE): entire module |
| |
| Each file is ~163 MB (fp32). The full original checkpoint was ~33.5 GB; the |
| frozen base weights (LLaMA + SigLIP + DINOv2 + projector + non-trainable |
| linears) are not redistributed and must be loaded from |
| `openvla/openvla-7b-prismatic`. |
|
|
| File layout matches the training-time save format: |
|
|
| ```python |
| state = torch.load(path, map_location="cpu", weights_only=False) |
| # state == {"model": {"per_compr": {...}, "cog_mem_bank": {...}, ...}} |
| ``` |
|
|
| To merge back into a freshly built MemoryVLA, load the full base checkpoint |
| first, then `state_dict.update()` each submodule with the matching keys from |
| this file. |
|
|
| ## Training |
|
|
| - per_device_bs=12 × grad_accum=4 × 2 GPUs → global_bs=96 |
| - max_steps=60 000 (LR=3e-4, sqrt-scaled from 2e-4 @ bs=32; cosine decay after |
| 3 000 warmup steps) |
| - save_interval=500 |
| - Instruction (constant per episode): |
| *"Push the T-shaped block to visit three different target locations on the |
| tabletop, without visiting the same target more than once"* |
|
|
| Hardware: 2× H100 80GB SXM5 (NVLink). |
|
|