upload LoRA-only checkpoints (steps 3500/4000/4500)

Browse files

Files changed (4) hide show

README.md +55 -117
step-003500-epoch-01-loss=0.1752.pt +3 -0
step-004000-epoch-01-loss=0.0904.pt +3 -0
step-004500-epoch-01-loss=0.1335.pt +3 -0

README.md CHANGED Viewed

@@ -1,124 +1,62 @@
-# MemoryVLA LoRA — realpushmultit-320 (4× RTX Pro 6000)
-LoRA fine-tune of OpenVLA-7B + MemoryVLA (DiT-L action head + CogMemBank) on
-[harrywang01/realpushmultit-320](https://huggingface.co/datasets/harrywang01/realpushmultit-320),
-a 320-episode real-robot multi-task push dataset.
-- Base: `openvla/openvla-7b-prismatic` (`step-295000-epoch-40-loss=0.2200.pt`)
-- Repo / code: `KuanchengWang/diffusion_policy`, branch `jinglin`
-- Entry: `train_memoryvla_realpushmultit.py`
-- W&B run: `williamcao-uc-san-diego/memoryvla_realpushmultit_lora/runs/ql4ervrw`
-## Recipe
-| | |
-|---|---|
-| Hardware | 4× NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120) |
-| Global batch | 4 GPU × per-device 64 × grad-accum 1 = **256** |
-| LR | 2e-4 peak, linear-warmup (500 steps) + cosine decay |
-| Optim | AdamW, weight_decay tied to recipe defaults |
-| Mixed precision | bf16 (FSDP, gradient checkpointing on) |
-| LoRA — LLaMA-2 | r=8, α=16, targets q_proj, v_proj |
-| LoRA — SigLIP | r=8, α=16, targets qkv (fused) — DINOv2 frozen |
-| LoRA — DiT-L self_attn | r=24, α=48, targets qkv (fused, via `_replace_linear`) |
-| LoRA — DiT-L per_attn | r=24, α=48, targets q, v (split) |
-| LoRA — CogMem cross | r=24, α=48, targets q, k, v |
-| LoRA — CogMem GateFusion | r=24, α=48 |
-| Trainable params | ~64 M (LoRA + modules_to_save) / 8.4 B (0.76 %) |
-| MemoryVLA config | DiT-L, future_action_window=15, group_size=16, mem_length=16, retrieval_layers=2, per_token_size=256, fusion=gate, consolidate=tome |
-| DiT diffusion | repeated_diffusion_steps=4 |
-## Run summary
-| Step | Epoch | Loss (train) | Wall-clock from start |
-|---|---|---|---|
-| 1000 | 0 | 0.1240 | 1h59m |
-| 2000 | 1 | 0.0893 | 3h33m |
-| 3000 | 1 | 0.0755 | 5h06m |
-| 4000 | 2 | 0.0635 | 6h39m |
-| 5000 | 3 | 0.0768 | 8h12m |
-| 6000 | 3 | 0.0703 | 9h45m |
-Training ended at step **6483** (loss 0.0468), short of the configured
-`MAX_STEPS=10000`. The MemoryVLA `base_strategy.run_vla_training` loop is
-written for the RLDS-style implicit-repeat dataloader OpenVLA was designed
-around (see the comment at `training/strategies/base_strategy.py:158`), but
-the realpushmultit zarr is a finite map-style `RealPushMultiTMemoryVLADataset`.
-At global batch 256 over 95 % of 435 260 timesteps ≈ 1615 steps/epoch, the
-loop exhausts the dataloader after ~4 epochs and exits cleanly via
-`return` from inside the iteration. Final ckpt is `step-006000`; the run
-log says "Training complete." (no error). To continue past one
-dataset-pass-worth of steps, wrap the dataloader with `itertools.cycle` or
-restructure the loop around an outer epoch loop.
-## Files
-```
-config.json                              — full run config (resolved CLI + defaults; base_vlm fixed to prism-dinosiglip-224px+7b)
-config.yaml                              — same, yaml flavor
-dataset_statistics.json                  — action mean/std over training split (REQUIRED for inference unnorm)
-run-metrics.jsonl                        — early run metadata
-memoryvla_realpushmultit_lora_bs64_v1.jsonl  — per-step train metrics
-checkpoints/
-  step-001000-epoch-00-loss=0.1240.pt    — 32 GB, merged: LoRA deltas folded into base weights, flat state_dict keys
-  step-002000-epoch-01-loss=0.0893.pt
-  step-003000-epoch-01-loss=0.0755.pt
-  step-004000-epoch-02-loss=0.0635.pt
-  step-005000-epoch-03-loss=0.0768.pt
-  step-006000-epoch-03-loss=0.0703.pt
-```
-## Loading
-Each ckpt has been **merged** — LoRA adapter weights (PEFT LLaMA + SigLIP,
-our LoRALinear on DiT-L qkv / CogMem cross / GateFusion, custom MHA-LoRA on
-DiT per_attn) are folded into the corresponding base weights with the
-scaling factor `α/r` applied, then the wrap keys (`base_layer.weight`,
-`lora_A`, `lora_B`, `base_model.model.` prefix) are dropped. The resulting
-state-dict matches a fresh, non-LoRA-wrapped MemoryVLA model 1-for-1, so
-`load_vla(...)` loads cleanly with `strict=True` and rollout / inference
-needs no extra code:
 ```python
-import sys, pathlib
-sys.path.insert(0, str(pathlib.Path("third_party/MemoryVLA").resolve()))
-from vla import load_vla
-vla = load_vla(
-    "checkpoints/step-006000-epoch-03-loss=0.0703.pt",
-    load_for_training=False,
-    action_model_type="DiT-L",
-    future_action_window_size=15,
-    past_action_window_size=0, action_dim=7,
-    mem_length=16, retrieval_layers=2, per_token_size=256,
-    fusion_type="gate", consolidate_type="tome",
-).to("cuda").to(torch.bfloat16).eval()
 ```
-To **resume training** from one of these, set `--is_resume True
---resume_step <step> --resume_epoch <epoch>`. `apply_memoryvla_lora` then
-wraps the model again with fresh (zero-initialised) adapters; the merged
-base carries the prior training's knowledge and new LoRA learns on top.
-The original unmerged ckpts are not preserved (the merge is exact and
-losslessly invertible only with the matching adapter shapes — see
-`scripts/merge_lora_ckpt.py` for the merge logic).
-## Reproduce
-```bash
-git clone git@github.com:KuanchengWang/diffusion_policy.git
-cd diffusion_policy && git checkout jinglin
-git submodule update --init --recursive   # both MemoryVLA + bundled diffusion_policy
-bash install_memoryvla_venv.sh            # then upgrade torch to 2.7.1+cu128 for Blackwell
-# Apply 6 sed patches to scripts/train_memoryvla_realpushmultit_a100x2.sh
-#   (see scripts/_launch_memoryvla_realpushmultit_rtx6000x4.sh for the wrapper)
-# Override: LORA_VISION_TARGETS='[qkv]' (SigLIP timm.Attention uses fused qkv,
-#   not split q_proj/v_proj)
-# Override: PER_DEVICE_BS=64 GRAD_ACCUM=1 (Blackwell 96 GB headroom)
-RUN_ID=memoryvla_realpushmultit_lora_bs64_v1 \
-MAX_STEPS=10000 SAVE_INTERVAL=1000 \
-LORA_VISION_TARGETS='[qkv]' \
-PER_DEVICE_BS=64 GRAD_ACCUM=1 \
-bash scripts/_launch_memoryvla_realpushmultit_rtx6000x4.sh
-```

+---
+license: mit
+tags:
+- robotics
+- vision-language-action
+- lora
+- memoryvla
+base_model: openvla/openvla-7b-prismatic
+---
+# MemoryVLA — RealPushMultiT LoRA fine-tune
+LoRA-only checkpoints from a fine-tune of MemoryVLA (`siglip-224px+mx-bridge`,
+backbone `prism-dinosiglip-224px+7b`, initialised from
+`openvla/openvla-7b-prismatic` step-295000) on the `harrywang01/RealPushMultiT`
+dataset (240 demos / 341 077 timesteps).
+## Contents
+Each `step-NNNNNN-epoch-EE-loss=L.LLLL.pt` is a compact subset of the full
+training checkpoint, containing only the **40.83 M trainable parameters**:
+- LoRA adapters
+  - **LLaMA-2-7B (LLM backbone)**: r=8, α=16 on `q_proj`, `v_proj`
+  - **SigLIP (vision)**: r=8, α=16 on fused `qkv`
+  - **DiT action model**: r=24, α=48 on attention `qkv` and perceiver
+    cross-attention `q`/`v`
+  - **Cognitive memory bank retrieval cross-attn**: r=24, α=48 on
+    `q_proj` / `k_proj` / `v_proj` (with `lora_cog_gate=True`)
+- `modules_to_save` (full small modules, trained outright)
+  - `action_model`: `x_embedder`, `t_embedder`, `z_embedder`, `final_layer`
+  - `cog_mem_bank`: `timestep_encoder`
+  - `per_mem_bank`: entire module
+  - `per_compr` (BottleneckSE): entire module
+Each file is ~163 MB (fp32). The full original checkpoint was ~33.5 GB; the
+frozen base weights (LLaMA + SigLIP + DINOv2 + projector + non-trainable
+linears) are not redistributed and must be loaded from
+`openvla/openvla-7b-prismatic`.
+File layout matches the training-time save format:
 ```python
+state = torch.load(path, map_location="cpu", weights_only=False)
+# state == {"model": {"per_compr": {...}, "cog_mem_bank": {...}, ...}}
 ```
+To merge back into a freshly built MemoryVLA, load the full base checkpoint
+first, then `state_dict.update()` each submodule with the matching keys from
+this file.
+## Training
+- per_device_bs=12 × grad_accum=4 × 2 GPUs → global_bs=96
+- max_steps=60 000 (LR=3e-4, sqrt-scaled from 2e-4 @ bs=32; cosine decay after
+  3 000 warmup steps)
+- save_interval=500
+- Instruction (constant per episode):
+  *"Push the T-shaped block to visit three different target locations on the
+  tabletop, without visiting the same target more than once"*
+Hardware: 2× H100 80GB SXM5 (NVLink).

step-003500-epoch-01-loss=0.1752.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3bda0a2cb4a3faf905549b465fd2ff01f4f140936aeda41f5c46ff543e33e8a
+size 163469523

step-004000-epoch-01-loss=0.0904.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4ada1dc2e5901d06c77076193387bcabad03308d999b7bac7cf5020dcfb74b14
+size 163469523

step-004500-epoch-01-loss=0.1335.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0c2ac16100218b3eef4a13e28735630e77a33ad8eba038471631e8cf79ed33f3
+size 163469523