lsnu
/

VLAarchtests

@@ -1,384 +0,0 @@
-# PERSIST-VLA-Tri fast build and test handoff for LCC
-## 1. Purpose
-This handoff is optimized for one thing: decide quickly whether the tri-branch persistence head is worth further work before spending time on Dobot data collection or high-fidelity simulation.
-The immediate target is not SOTA. The target is to establish that the architecture is technically sound, trains stably, does not collapse the transfer branch, and stays in the same rough performance ballpark as the slim baselines and ablations used in this stage.
-This plan deliberately avoids Isaac Sim, large VLA finetuning, and any heavyweight external benchmark until the current structure clears a small set of internal gates.
-## 2. What is in scope right now
-Use the current reference repo only.
-Run the existing unit tests.
-Run a very short synthetic training loop.
-Run a held-out evaluation with a single script that emits the metrics you actually need to make a go/no-go decision.
-Run a tiny sweep over seeds and sequence horizon.
-Use oracle-based ablations to answer whether transfer is even worth modeling in the current proxy.
-Do **not** start OpenVLA, OFT, Isaac, or DeformableRavens until the internal gates in Section 8 pass.
-## 3. Current reality of the attached architecture
-The current code base is usable as a research reference implementation. The important facts for the first iteration cycle are simple.
-The model and tests run.
-The synthetic environment already contains the three mechanics classes (`foliage`, `bag`, `cloth`) and the three persistence branches (`hold`, `transfer`, `release`).
-The known weak point is not general instability. The weak point is branch behavior. In particular, `transfer` is the branch that matters scientifically, and it is also the one most likely to collapse unless the training signal is strong enough. `release` is still sparse, so the first week should be treated as a **hold-vs-transfer** debugging cycle, with `release` kept in the structure but not treated as a hard success criterion.
-## 4. Minimal file layout on cluster
-Use a directory layout like this so logs and scripts do not get mixed into the repo.
-```text
-$WORKDIR/
-  persist_vla_reference/          # unzipped reference repo
-  persist_vla_lcc_handoff/        # this handoff bundle
-  runs/
-    smoke/
-    quick/
-    seq/
-    sweep/
-  logs/
-```
-## 5. Minimal software requirements
-For the current synthetic stack, the dependency surface is very small.
-You need Python, PyTorch, and pytest.
-You do **not** need Isaac Sim.
-You do **not** need MuJoCo.
-You do **not** need any Dobot SDK.
-If your cluster environment already has a PyTorch module or conda env, use that. Do not spend time building a custom environment unless the default one fails.
-A minimal check is:
-```bash
-python - <<'PY'
-import torch, pytest
-print('torch', torch.__version__)
-print('cuda', torch.cuda.is_available())
-PY
-```
-If that works, move on.
-## 6. Fastest possible build sequence
-### 6.1. Copy files to the node or shared storage
-Unzip the architecture repo into `persist_vla_reference/`.
-Copy this handoff bundle next to it as `persist_vla_lcc_handoff/`.
-### 6.2. First interactive smoke pass
-Do the first run interactively, not through a long SLURM batch chain. The point is to fail fast.
-```bash
-cd $WORKDIR/persist_vla_reference
-python -m pytest -q
-python scripts/demo_inference.py
-python $WORKDIR/persist_vla_lcc_handoff/quick_eval_tri.py \
-  --repo $WORKDIR/persist_vla_reference \
-  --device cpu \
-  --steps 20 \
-  --batch-size 8 \
-  --eval-batch-size 32 \
-  --hidden-dim 24 \
-  --output-json $WORKDIR/runs/smoke/smoke_eval.json
-```
-Expected result.
-`pytest` should pass.
-`demo_inference.py` should print branch and phase information without error.
-`quick_eval_tri.py` should emit a JSON file with training trace and held-out metrics.
-If this fails, do not queue anything else.
-### 6.3. First GPU-backed quick run
-Once the smoke pass works, switch to one GPU and run the smallest meaningful experiment.
-```bash
-python $WORKDIR/persist_vla_lcc_handoff/quick_eval_tri.py \
-  --repo $WORKDIR/persist_vla_reference \
-  --device cuda \
-  --steps 80 \
-  --batch-size 24 \
-  --eval-batch-size 256 \
-  --hidden-dim 48 \
-  --sequence-horizon 1 \
-  --output-json $WORKDIR/runs/quick/base_s80_h48_seed0.json
-```
-This is the default quick sanity run. It should finish fast enough that queue time matters more than compute time.
-## 7. Exactly what to run in week 1
-Run the following in order. Do not expand the matrix until the previous row is green.
-### Stage A. Smoke and basic training
-Run one smoke pass.
-Run one quick GPU pass with `sequence_horizon=1`.
-Run one quick GPU pass with `sequence_horizon=3`.
-Commands:
-```bash
-python $WORKDIR/persist_vla_lcc_handoff/quick_eval_tri.py \
-  --repo $WORKDIR/persist_vla_reference \
-  --device cuda \
-  --steps 80 \
-  --batch-size 24 \
-  --eval-batch-size 256 \
-  --hidden-dim 48 \
-  --sequence-horizon 1 \
-  --seed 0 \
-  --output-json $WORKDIR/runs/quick/base_s80_seq1_seed0.json
-python $WORKDIR/persist_vla_lcc_handoff/quick_eval_tri.py \
-  --repo $WORKDIR/persist_vla_reference \
-  --device cuda \
-  --steps 80 \
-  --batch-size 24 \
-  --eval-batch-size 256 \
-  --hidden-dim 48 \
-  --sequence-horizon 3 \
-  --seed 0 \
-  --output-json $WORKDIR/runs/seq/base_s80_seq3_seed0.json
-```
-### Stage B. Small seed sweep
-If Stage A works, run three seeds for the two settings above.
-That means six total jobs.
-Use seeds `0 1 2`.
-Keep everything else fixed.
-Do **not** change more than one variable at a time.
-### Stage C. Slightly longer run only if Stage B is promising
-If Stage B does not collapse, run one longer pass.
-```bash
-python $WORKDIR/persist_vla_lcc_handoff/quick_eval_tri.py \
-  --repo $WORKDIR/persist_vla_reference \
-  --device cuda \
-  --steps 200 \
-  --batch-size 32 \
-  --eval-batch-size 512 \
-  --hidden-dim 64 \
-  --sequence-horizon 3 \
-  --seed 0 \
-  --output-json $WORKDIR/runs/seq/base_s200_seq3_h64_seed0.json
-```
-This is still a slim run. It is only meant to answer whether the transfer branch starts behaving better with a slightly stronger setting.
-## 8. Metrics that matter and the pass/fail gates
-The JSON emitted by `quick_eval_tri.py` contains more than you need. Focus on the following fields.
-`oracle_transfer_state_fraction`
-This is the fraction of held-out labels where the synthetic oracle says `transfer` is the right persistence mode. This is a property of the proxy task distribution, not of the trained model. If this is too low, your synthetic proxy is not exercising the contribution you care about.
-`transfer_fraction_foliage`, `transfer_fraction_bag`, `transfer_fraction_cloth`
-These should preserve the intended inductive bias. Bag and cloth should not look less transfer-heavy than foliage.
-`oracle_transfer_branch_gain_transfer_states`
-This is the oracle gain from allowing the transfer branch instead of forcing only hold/release, but measured only on transfer-labeled held-out states. If this is near zero, the architectural contribution has no room to pay off in the current proxy, no matter how well you train.
-`oracle_transfer_template_gain_transfer_states`
-This isolates whether the explicit transfer template in the candidate library matters.
-`oracle_approx_ratio_of_model_choice`
-This is the teacher utility of the model-selected candidate divided by the oracle-best candidate utility (using means, not unstable per-sample ratios). This is the best single number for the current stage. It tells you whether the model is choosing candidate chunks close to oracle best.
-`mode_transfer_recall_vs_oracle`
-This is the recall of the model's selected transfer branch for the decoded mode chunk, relative to the oracle-best branch for that same chunk. This is the direct branch-collapse detector.
-Use these gates.
-### Gate 1. Proxy sanity
-Pass if:
-```text
-oracle_transfer_state_fraction >= 0.10
-transfer_fraction_bag > transfer_fraction_foliage
-transfer_fraction_cloth > transfer_fraction_foliage
-oracle_transfer_branch_gain_transfer_states >= 0.02
-oracle_transfer_template_gain_transfer_states >= 0.01
-```
-Fail action: if Gate 1 fails, do not touch the model first. Fix the synthetic proxy or the oracle coefficients. The environment is not rewarding transfer strongly enough.
-### Gate 2. Candidate selection quality
-Pass if the median across the three-seed quick sweep satisfies:
-```text
-oracle_approx_ratio_of_model_choice >= 0.95
-```
-Interpretation: even if branch labels are imperfect, the model is still choosing candidate chunks close to oracle-best.
-### Gate 3. Branch collapse check
-Pass if the median across the three-seed quick sweep satisfies:
-```text
-mode_transfer_recall_vs_oracle > 0.10
-```
-This threshold is intentionally low for week 1. The goal is only to prove that transfer does not stay dead.
-For the longer run (`steps=200`, `sequence_horizon=3`), the preferred target is:
-```text
-mode_transfer_recall_vs_oracle >= 0.25
-```
-### Gate 4. Stability
-Pass if:
-`eval_total_loss` is not exploding.
-The training trace shows an early drop in `total` and the world losses remain finite.
-`branch_prior` and `branch_choice` are not diverging upward late in training.
-## 9. What to do when a run fails
-Use this order. Do not change everything at once.
-### Failure type A. Transfer recall stays at zero
-This is the most important failure mode.
-First, rerun with `sequence_horizon=3`. If transfer is temporal, horizon 1 is too weak.
-Second, increase the transfer emphasis in `PersistVLAConfig.branch_loss_weights`. The current default is `(1.0, 4.0, 3.0)`. Move only the transfer weight first, for example to `(1.0, 6.0, 3.0)`.
-Third, reduce `branch_prior_scale` slightly if the planner prior is overpowering branch-specific evidence. A first test is `0.15 -> 0.05`.
-Fourth, oversample transfer-heavy states in the synthetic dataset. Do this in the sampler, not by post-hoc filtering during training.
-Do **not** spend time on `release` yet.
-### Failure type B. Oracle gains for transfer are too small
-Do not blame the learner. This means the proxy itself is too weak.
-Increase transfer-support dynamics for `bag` and `cloth` in the synthetic oracle before changing the network.
-You need the proxy to make transfer genuinely useful.
-### Failure type C. Candidate choice ratio is weak but transfer recall is nonzero
-This means the branch head may be waking up, but the candidate set is not good enough.
-Before adding a larger backbone, improve the candidate library.
-The first low-cost changes are:
-increase `num_candidates` from 6 to 8,
-add one more transfer-flavored structured template,
-or make the existing transfer template less similar to the hold template.
-### Failure type D. Release never appears
-Ignore it for the first week.
-Treat the system as a hold-vs-transfer architecture with a dormant release head. Only revisit release after transfer is healthy.
-## 10. Recommended SLURM usage
-Two ready-to-edit sbatch files are included in this handoff bundle.
-Use `slurm_smoke.sbatch` first.
-Use `slurm_seed_sweep.sbatch` after the smoke run is green.
-The scripts intentionally use environment variables instead of hard-coded cluster-specific paths.
-## 11. When to move beyond the current repo
-Move to the next stage only if all of the following are true.
-The proxy sanity gate passes.
-The candidate choice ratio is at or above 0.95 on the slim sweep.
-Transfer recall is consistently nonzero and improves with sequence horizon.
-The bag and cloth scenarios remain more transfer-heavy than foliage.
-Only then should you spend time on the first external simulation benchmark.
-## 12. First external benchmark after this handoff
-The first external benchmark should be a **bag-only** pilot, not all three environments.
-Use DeformableRavens bag tasks as the first port target.
-For the first external baseline, use **ACT**, not OpenVLA-OFT. ACT is cheaper to stand up and is already a strong imitation-learning baseline for bimanual manipulation.
-The rule is simple.
-Do not burn cluster time on OFT or any large VLA until the current structure has shown that transfer is real and learnable in the slim proxy.
-## 13. Minimal checklist for the assigned developer
-1. Unzip the repo and handoff bundle into shared storage.
-2. Verify Python, PyTorch, and pytest.
-3. Run `pytest -q`.
-4. Run `scripts/demo_inference.py`.
-5. Run the 20-step smoke eval.
-6. Run the two 80-step quick jobs (`sequence_horizon=1` and `3`).
-7. Run the three-seed sweep for both horizons.
-8. Summarize the six JSON files into a small table with the metrics in Section 8.
-9. Decide: proceed, fix proxy, or fix branch-collapse.
-## 14. Deliverables expected at the end of week 1
-By the end of the first iteration cycle, the developer should hand back exactly three things.
-A one-page table of runs and metrics.
-A sentence-level diagnosis of the dominant failure mode (proxy too weak, branch collapse, or candidate set weak).
-A concrete recommendation for the next change, with only one primary change proposed for week 2.