File size: 36,238 Bytes

aa584de

# Developer handoff: elastic-occlusion bimanual VLA on 1×L40S

This document is the working handoff for rebuilding the current repo into a credible research system for bimanual reveal/retrieve under elastic occlusion. It supersedes the narrower short-sprint handoff in `handoff/instructions4.md`. The short-sprint document is still useful as a proxy-benchmark checklist, but it is not enough for the next stage.

The project goal is not to invent a new general-purpose trunk. The goal is to attach a small, structured adapter to a strong public bimanual trunk, preserve general-task competence, and create measurable gains on tasks that look like the future real benchmark:

1. foliage reveal/retrieve (push leaves aside, keep them aside, then retrieve a hidden target),
2. bag opening/retrieve (open a compliant container enough for the other arm to see and retrieve),
3. folded-clothes suitcase retrieval (slight lift/separate, preserve fold structure, retrieve a hidden object).

The right short-term success condition is:

- general public tasks: `trunk + adapter` should be in the same ballpark as `trunk alone`,
- reveal/retrieve-like tasks: `trunk + adapter` should beat `trunk alone` and other generic baselines.

The adapter is where the novelty should live. The trunk should stay as standard and defensible as possible.

---

## 1. What the current repo actually shows

### 1.1 Core architecture in the repo

The current codebase contains three relevant policy families in `VLAarchtests/code/reveal_vla_bimanual/models/policy.py`:

- `BackboneOnlyPolicy`
- `InteractionBimanualPolicy`
- `ElasticRevealBimanualPolicy`

The latest elastic path is the relevant one for this project. It is a monolithic policy composed of:

- a frozen VL backbone wrapper (`models/backbones.py`),
- dual observation memory (`models/observation_memory.py`),
- an interaction / elastic-occlusion state head (`models/reveal_head.py`),
- a coordinated chunk decoder with task-routed proposal modes (`models/action_decoder.py`),
- an elastic-occlusion rollout model (`models/world_model.py`),
- a cascade planner with structured feasibility logic (`models/planner.py`).

This is the part worth preserving conceptually. The important fields in the current elastic state head already match the real tasks unusually well:

- visibility / target confidence,
- access corridor / insertion corridor,
- persistence / release-collapse,
- reocclusion,
- disturbance / damage,
- fold preservation / top-layer stability / lift-too-much risk.

Those signals are directly relevant to the future foliage, bag, and clothes tasks.

### 1.2 What the current repo does **not** show

The repo does **not** currently show that the latest full architecture is a strong general bimanual policy. It also does **not** show that the heavy memory + world-model stack is helping.

The most important current findings from the repo are:

- In the proxy sprint summary, the base model is below random and below oracle on its own candidate set.
- Disabling memory improves the proxy mean over the base model.
- The planner matters.
- The best proxy result comes from task-routed checkpoint routing, not from a single unified learned model.
- The non-zero RLBench result in the “dual_push_nonzero” line is not the kind of fair architecture win needed for a paper claim. It is a retrieval/retargeting positive control, not a clean full-policy benchmark result.
- The local general-task anchor results are not yet strong enough to treat the current custom trunk path as a valid base.

### 1.3 What the existing tests are good for

The current tests are mostly of three kinds:

1. **Contract / plumbing tests**  
   These verify shapes, token paths, geometry propagation, dataset fields, shortlist plumbing, RVT wrapper output shapes, etc. They are useful and should stay.

2. **Directional proxy tests**  
   These verify that scripted “good” reveal actions beat obviously bad ones in the procedural proxy benchmark. These are useful because they validate that the proxy metrics are at least pointed in the correct direction.

3. **Evidence-free competence surrogates**  
   Several tests only prove that a feature toggles or produces different tensors (for example memory and geometry tests). They do not prove the feature helps task performance.

The current test suite is therefore necessary, but not sufficient. It validates software correctness and some proxy metric sanity. It does not validate benchmark strength.

### 1.4 Repo findings that should drive the redesign

Treat the following as the main empirical lessons from the current repo:

- **Keep**: explicit reveal-state prediction.
- **Keep**: task-aware macro proposals.
- **Keep**: feasibility gating for retrieve-like actions.
- **Question**: dual memory (current evidence is weak to negative).
- **Question**: heavy token-level world model (too expensive and under-justified).
- **Question**: local custom RVT path as the main scientific trunk (currently too fragile).
- **Do not claim**: that the current non-zero RLBench result proves the architecture works.

---

## 2. Research claim to target

Do **not** try to claim a new general VLA or a new general bimanual architecture.

The claim should be:

> A structured adapter for foundation bimanual policies that improves reveal/retrieve under elastic occlusion by predicting reveal-state variables (visibility, access, persistence, reocclusion, disturbance, fold preservation), generating task-routed reveal macros, and enforcing retrieve feasibility before execution.

This claim is much cleaner, and much closer to what the repo already hints at.

That claim is only defensible if all of the following are true:

1. the base trunk is strong and reproduced fairly,
2. the adapter causes little or no regression on public general tasks,
3. the adapter gives a real gain on public or proxy tasks that stress reveal/retrieve,
4. the gain cannot be explained away by trivial checkpoint routing alone.

---

## 3. Target system after refactor

The target architecture should be **smaller** than the current monolithic one.

### 3.1 Trunk

Use a strong public bimanual trunk with a faithful evaluation path. In order of preference:

1. **3D FlowMatch Actor (3DFA)**, if code/checkpoints are practical to evaluate fairly.
2. **Official PerAct2 / RVT-style stack**, if 3DFA is not practical.
3. **Official AnyBimanual** as a transfer baseline and possibly as the starting trunk if its code path is the most stable locally.

Do not continue making CLIP the scientific center of the project. The trunk should be imported as a stable base, not reinvented.

### 3.2 Adapter

The adapter should sit **above** the trunk and should be trainable with the trunk frozen. It should contain exactly four core pieces:

1. **Reveal-state head**  
   Predict scalar and low-resolution field variables for:
   - visibility,
   - access corridor / insertion corridor,
   - persistence / support stability,
   - reocclusion,
   - disturbance,
   - task-specific metrics (bag mouth, foliage opening, cloth fold preservation, top-layer stability).

2. **Task-routed proposal prior**  
   Generate a small number of macro proposal modes appropriate for the task family. Keep the current proposal vocabulary idea, but do not let it become a separate checkpoint-routing story. The task routing should be internal to one model.

3. **Retrieve-feasibility gate**  
   Before choosing retrieve or insert-like modes, require predicted access, persistence/support, and reocclusion to satisfy thresholds or a learned gating classifier. This is one of the strongest, most defensible pieces of structure in the current repo.

4. **Lightweight reveal-transition model**  
   A small transition model over reveal-state variables only. Do **not** keep the full token-heavy spatial rollout model as the default. Predict the next reveal-state summary (and optionally a tiny field map), not the entire scene token stack.

### 3.3 Optional memory

Make memory optional and minimal. The default should be either:

- no memory, or
- a very short reveal-state cache / exponential filter over a few recent steps.

Do not keep the current dual selective memory as a default dependency until it proves value on benchmark success.

### 3.4 No-op / fallback path

This is critical.

The adapter must have a true **no-op** mode:

- on tasks outside the reveal/retrieve family, or
- when the adapter is uncertain,

the system should fall back to the trunk’s default action distribution or trunk shortlist.

This is the cleanest way to preserve general-task performance.

---

## 4. Concrete code changes

The fastest path is not to patch the current monolith forever. Refactor it into a stable trunk interface plus a narrow adapter package.

### 4.1 `models/backbones.py`

#### Changes required

- Replace the current “backbone wrapper does everything” mentality with a narrow `TrunkInterface`.
- Standardize outputs:
  - latent tokens,
  - optional trunk action distribution or trunk candidate set,
  - any geometry features the adapter is allowed to use.
- Remove the assumption that CLIP is the main path.
- Keep the current CLIP path only as a development/debug baseline.
- Treat the current RVT wrapper as provisional until it matches an official evaluation path.
- Add an explicit `NoOpAdapterCompatibleTrunkOutput` schema so the adapter can be bypassed without shape hacks.

#### Why

The current wrapper mixes too much custom logic into the backbone path. That makes it hard to tell whether failures are due to the trunk, geometry handling, or the adapter.

### 4.2 `models/policy.py`

#### Changes required

Split the current policy into:

- `FoundationTrunkPolicy`
- `ElasticOcclusionAdapter`
- `AdapterWrappedPolicy`

The wrapped policy should support three modes:

- `adapter_off`
- `adapter_noop`
- `adapter_active`

The execution contract should be:

1. get trunk tokens and trunk action / trunk candidates,
2. if adapter inactive or low confidence, return trunk action,
3. otherwise rank a small candidate set using the adapter and return the selected chunk.

#### Why

This makes no-regression testing possible. Right now the current monolithic policy hides whether the trunk is still intact.

### 4.3 `models/reveal_head.py`

#### Changes required

Keep the best part of the repo, but simplify and formalize it.

- Split outputs into:
  - task-agnostic reveal variables,
  - task-specific metrics,
  - optional low-res spatial fields.
- Add masks so task-specific losses only apply when valid.
- Preserve the cloth-specific metrics. They are one of the best differentiators for the future suitcase benchmark.
- Add explicit calibration support (for example confidence outputs or logits) so the state head can be evaluated independently of policy success.

#### Why

The reveal-state head is likely the publishable core. It needs cleaner interfaces and evaluation, not more entanglement.

### 4.4 `models/action_decoder.py`

#### Changes required

Keep the current task proposal vocabulary concept, but tighten it:

- candidate 0 must always be the trunk/base action,
- proposal candidates must stay near the trunk action initially,
- proposal mode families should be internal to one model, not external checkpoint routing,
- add a generic fallback mode family for non-target tasks,
- keep explicit mode names for analysis and paper figures.

Current task families to preserve and clean up:

- foliage: `widen_gap`, `maintain_gap`, `insert_actor`, `retrieve`, etc.
- bag: `widen_mouth`, `maintain_mouth`, `probe_inside`, `insert_actor`, `retrieve`
- cloth: `lift_edge`, `separate_layer`, `stabilize_fold`, `maintain_lift`, `insert_actor`, `retrieve`

#### Why

The proposal vocabulary is useful. The current best proxy result already suggests task specialization matters. But the specialization must become a principled internal prior, not a checkpoint-routing workaround.

### 4.5 `models/planner.py`

#### Changes required

Refactor the planner into two explicit parts:

1. **hard/soft feasibility gate**
2. **residual reranker**

The gate should use reveal-state variables only. The reranker can use the lightweight transition model and proposal logits.

Also add:

- a clean `identity` planning mode,
- a clean `trunk_only` selection mode,
- an `adapter_confidence` score,
- diagnostics for every rejected retrieve-like candidate.

#### Why

The current planner appears to be one of the few useful parts of the architecture. It needs to be isolated and made measurable.

### 4.6 `models/world_model.py`

#### Changes required

Do not keep the current full token-heavy elastic rollout model as the default research path.

Replace it with a much smaller transition model over:

- scalar reveal-state summaries,
- optionally one or two low-res fields (for example access map and support map),
- action macro / candidate metadata.

The transition model should predict:

- next visibility,
- next access corridor,
- next persistence / support,
- next reocclusion,
- next disturbance / fold metrics.

Only reintroduce a heavier spatial model if the lightweight model clearly helps.

#### Why

The current rollout model is too expensive and too under-validated for a single-L40S research loop.

### 4.7 `models/observation_memory.py`

#### Changes required

Default behavior should be:

- disabled, or
- replaced by a tiny reveal-state cache.

If the current dual memory stays in the repo, mark it experimental. Either wire the suppression margin logic properly or remove it. Right now it looks half-finished and the current proxy evidence is not favorable.

#### Why

Memory is currently a likely liability, not a likely differentiator.

### 4.8 `train/losses.py`

#### Changes required

Reweight the training objective around what is actually learnable and measurable.

Required losses:

- action BC / trajectory loss from the trunk policy path,
- **candidate ranking loss** against oracle utility within the same candidate set,
- proposal mode classification / assignment,
- reveal-state regression/classification,
- retrieve-feasibility gate loss,
- lightweight transition-model loss,
- **no-regression distillation** from the trunk on general tasks,
- optional calibration loss for reveal-state confidence.

Losses to demote or remove unless justified by results:

- large generic memory losses,
- large token-level world-model reconstruction losses.

#### Why

The repo already points to the correct training target: close the gap to the oracle chooser on the candidate set. That is much better than adding more latent machinery.

### 4.9 `train/trainer.py`

#### Changes required

Add explicit training regimes:

- `trunk_only_eval`
- `adapter_noop_eval`
- `adapter_train_frozen_trunk`
- `adapter_finetune_light`
- `general_distillation_only`
- `proxy_rank_only`

Freeze the trunk by default. Any trunk finetuning should be delayed until the adapter proves itself.

Also add a single switch that controls whether evaluation is:

- trunk only,
- adapter no-op,
- adapter active,
- adapter active with planner off,
- adapter active with gate off.

#### Why

The current trainer still reflects an architecture-search phase. The next phase needs controlled, fair comparisons.

### 4.10 Dataset / teacher generation code

Relevant existing code already exists for proposal alignment and proxy data generation. Reuse it, but narrow it.

Required changes:

- generate oracle labels and candidate utilities for proxy tasks,
- export reveal-state supervision targets explicitly,
- export candidate-mode assignments,
- export task metadata separately from free-form language,
- ensure every sample can be evaluated in:
  - trunk-only mode,
  - no-op mode,
  - adapter mode.

Do not let text strings be the only task family signal. Explicit task metadata must be available.

---

## 5. What to keep, what to remove, what to treat as provisional

### Keep

- explicit reveal-state variables,
- task-routed macro proposal vocabulary,
- retrieve-feasibility gate,
- geometry-aware observation path,
- existing proxy scripted sanity tests,
- candidate-ranking supervision.

### Remove from the default path

- heavy dual memory as a required component,
- full token-heavy rollout model,
- any claim based on checkpoint routing alone,
- any claim based on the retargeted demo positive control.

### Treat as provisional

- custom RVT wrapper,
- local RLBench general benchmark path until official baseline reproduction is clean,
- memory-related gains unless they appear in a proper task-success benchmark.

---

## 6. Benchmark strategy

The benchmark plan should be staged. Do not jump straight to a full RLBench sweep.

### Phase 0. Baseline reproduction

Goal: prove that the evaluation path is real.

Required outcome:

- at least one official public trunk reproduces a known strong score on a small anchor subset,
- one anchor task should match a public or repo-validated release closely enough to trust the pipeline.

If this fails, stop and fix evaluation before touching the adapter further.

### Phase 1. General-task anchor set

Use a small public anchor set that is broad enough to catch regressions, but small enough to run repeatedly on one L40S.

Recommended anchor tasks:

- coordinated push box,
- coordinated lift ball,
- dual push buttons,
- handover item,
- lift tray.

These are not the target application tasks. They are regression sentries.

Acceptance criterion:

- `adapter_noop` should be essentially identical to `trunk_only`,
- `adapter_active` should remain in the same ballpark as `trunk_only`,
- any loss on the anchor mean must be small and explainable.

If the trunk itself is weak on the chosen anchor set, replace the trunk. Do not proceed with a weak base.

### Phase 2. Existing proxy benchmark (internal shaping only)

Use the existing proxy suite as an architecture-shaping instrument, not as the main paper result.

Preserve the narrow stress slices from the existing handoff:

- nominal,
- high reocclusion,
- camera perturbation.

Preserve the task slices:

- foliage,
- bag,
- cloth.

Keep the simple baselines:

- random,
- candidate 0,
- oracle chooser,
- scripted good/bad actions.

What to measure beyond success:

- reveal-state prediction correlation with proxy ground truth,
- ranking correlation with oracle utility,
- gate precision/recall for unsafe retrieve attempts,
- effect of proposal families by task,
- reocclusion after reveal,
- fold-preservation metrics on cloth slices.

### Phase 3. Public target-like tasks

This is the most important new benchmark stage.

The future real benchmark does not exist yet, so approximate it with public tasks that stress:

- containment opening,
- hidden-object access,
- cluttered retrieval,
- partial reveal before retrieve,
- disturbance control.

Use a small public target-like subset first. Candidate tasks to prioritize:

- open drawer,
- put item in drawer / retrieve-like container interactions,
- take shoes out of box,
- shell game,
- pick up notebook,
- straighten rope.

The exact final subset can change if some tasks prove unstable, but the principle should stay the same: these tasks should be more target-like than the anchor set.

### Phase 4. Deformable / garment benchmarks

For the clothes/suitcase direction, add a public deformable benchmark as soon as the infrastructure is stable.

Priority order:

1. GarmentLab (if practical to run),
2. GarmentPile or similar garment-clutter retrieval benchmarks,
3. other public deformable-manipulation tasks only if they are easy to integrate.

This stage matters because the suitcase task is probably the strongest future novelty angle.

### Phase 5. Broader robustness benchmark

Only after phases 0–4 succeed, consider a broader dual-arm benchmark such as RoboTwin 2.0 or a wider RLBench/PerAct2 sweep.

Do not do this early. It is expensive and not yet the right bottleneck.

---

## 7. Baselines that must be included

At minimum, every meaningful experiment should compare against:

1. **the same trunk alone**  
   This is the most important baseline.

2. **the same trunk with adapter disabled / no-op**  
   This isolates whether the wrapper is already damaging performance.

3. **PerAct2**  
   Use official or faithful public numbers / code path.

4. **AnyBimanual**  
   Important because the repo already references it and because transfer from strong unimanual data is relevant.

5. **3DFA**, if evaluation is practical  
   This is the strongest public benchmark baseline for bimanual PerAct2-style tasks and should be the aspirational reference.

Optional if practical:

- CoFreeVLA (useful because it is also a structured auxiliary head on top of a VLA),
- ActiveVLA (conceptually relevant for active perception),
- task-specific academic comparisons in writing (Vision in Action, bag SOI model, garment retrieval papers), even if not reproduced in code.

---

## 8. Required ablations

The current repo already shows that “big architecture blob vs baseline” is not informative enough. The next paper-worthy evidence must isolate the actual source of gain.

Run the following ablations in order.

### General-task ablations

1. `trunk_only`
2. `trunk + adapter_noop`
3. `trunk + adapter_active (gate only)`
4. `trunk + adapter_active (gate + reveal-state head)`
5. `trunk + adapter_active (gate + reveal-state + proposal prior)`
6. `trunk + adapter_active (gate + reveal-state + proposal prior + lightweight transition model)`
7. optional: `+ short reveal cache`

Interpretation target:

- general tasks should not fall apart as structure is added,
- if they do, the adapter is not sufficiently no-op-safe.

### Target-like ablations

1. full adapter
2. no gate
3. no proposal prior
4. no task conditioning
5. no lightweight transition model
6. no geometry
7. no depth
8. no cloth-specific metrics (for the cloth slice only)
9. checkpoint routing only (to prove that routing alone is not the full story)

Interpretation target:

- gate should matter,
- proposal prior should matter,
- cloth-specific metrics should matter on cloth-like slices,
- routing alone should not account for the final gain.

### Memory ablations

Do these late, not early:

- no memory,
- short reveal cache,
- current dual memory.

If dual memory does not clearly beat no memory on actual task success, drop it.

---

## 9. Tests to add or rewrite

The current suite is decent for plumbing. It now needs benchmark-faithfulness tests and ablation-protecting tests.

### 9.1 Keep the current useful tests

Keep and maintain the existing tests that verify:

- proxy scripted benchmark directionality,
- geometry path activation under camera perturbation,
- dataset geometry fields,
- proposal shortlist plumbing,
- task metadata override behavior,
- candidate ranking loss behavior.

### 9.2 Add the following tests

#### `test_trunk_noop_equivalence.py`

With adapter disabled or in strict no-op mode, verify that:

- action mean / candidate set match the trunk path exactly (or within tight tolerance),
- no planner or routing side effects change outputs.

This is the single most important new test.

#### `test_trunk_interface_official_eval_parity.py`

For one selected official trunk and one frozen batch, verify that:

- preprocessing,
- camera handling,
- token layout,
- action decoding,

match the official implementation path closely enough to trust the wrapper.

This should be an integration test, not just a shape test.

#### `test_adapter_gate_blocks_unsafe_retrieve.py`

Build explicit synthetic reveal states where retrieve should and should not be allowed. The current planner already contains similar logic; formalize it into a direct unit test.

#### `test_reveal_state_metric_calibration.py`

For proxy env rollouts with known labels, verify that predicted reveal-state metrics correlate with the simulator labels and are not collapsed.

#### `test_candidate_ranking_matches_oracle.py`

Given a batch with oracle candidate utilities from the proxy env, verify that training reduces the gap between the model ranker and the oracle chooser.

This should be a real learned ranking test, not just a toy-array loss test.

#### `test_task_specific_loss_masking.py`

Verify that foliage metrics are not trained on bag/cloth tasks, bag metrics are not trained on foliage/cloth tasks, etc.

#### `test_cloth_specific_metrics_affect_selection.py`

For cloth-like proxy cases, verify that fold-preservation / lift-too-much risk can change candidate selection even when nominal reachability is similar.

#### `test_general_eval_protocol_is_identical.py`

Ensure that `trunk_only`, `adapter_noop`, and `adapter_active` all use the same observation stack, same action horizon, same task subset, and same evaluation step budget.

This prevents accidental unfairness.

### 9.3 Promote some current tests from “unit” to “benchmark guardrails”

The following should become part of the required CI / pre-run checklist:

- geometry path smoke test,
- dataset geometry/history test,
- no-op equivalence test,
- benchmark protocol identity test.

---

## 10. Metrics that matter

Do not rely on success alone.

### General-task metrics

- task success,
- return (if available),
- variance across seeds,
- regression relative to trunk.

### Target-like metrics

- success,
- visibility gain,
- access / insertion corridor gain,
- persistence / support gain,
- reocclusion after reveal,
- disturbance / damage,
- fold preservation (cloth-like slice),
- unsafe retrieve rate,
- oracle gap on candidate ranking.

### Calibration / diagnostics

- correlation of predicted reveal metrics with simulator ground truth,
- gate precision / recall,
- candidate shortlist recall of oracle candidate,
- proposal mode usage by task,
- fallback rate to trunk.

The fallback rate matters. If the adapter almost never activates, then the system may preserve general performance but not meaningfully help target tasks. If it always activates and hurts general tasks, it is not safe enough.

---

## 11. Acceptance gates

These gates should determine whether to continue, simplify, or stop.

### Gate A. Trunk validity

Pass only if an official or faithful trunk path is clearly non-trivial on the anchor set.

If this fails, stop. Do not spend effort on the adapter yet.

### Gate B. No-op safety

Pass only if `adapter_noop` is effectively identical to `trunk_only`.

If this fails, stop and fix the wrapper.

### Gate C. General-task parity

Pass only if `adapter_active` stays in the same ballpark as `trunk_only` on the anchor set. A small drop may be acceptable, but not a collapse.

Use a simple rule for the first pass:

- mean absolute drop on the anchor set should be very small,
- no single anchor task should collapse catastrophically.

If the adapter is helping target-like tasks but causing a broad general-task collapse, the architecture is not ready.

### Gate D. Target-like gain

Pass only if the full adapter clearly beats:

- trunk alone,
- adapter no-op,
- random,
- candidate 0,
- and ideally narrows the oracle gap.

This is where the architecture starts to become scientifically interesting.

### Gate E. Non-trivial novelty

Pass only if the gain is not explained almost entirely by checkpoint routing or trivial task labels. The final model should be a single structured adapter, not a routing script disguised as a model.

---

## 12. Recommended training strategy on 1×L40S

The compute constraint implies one principle: **do not retrain the trunk repeatedly**.

### Use this strategy

1. Choose one strong trunk.
2. Freeze it.
3. Build the adapter around it.
4. Run many cheap adapter experiments.
5. Only consider light trunk finetuning after the adapter is already useful.

### Practical guidelines

- mixed precision everywhere practical,
- gradient checkpointing if needed,
- keep candidate counts modest,
- keep rollout horizon short,
- keep the transition model lightweight,
- train on a narrow but representative task set,
- log every candidate-level diagnostic needed for offline analysis.

### What not to do

- do not repeatedly launch full-scale trunk retraining,
- do not run full benchmark sweeps before anchor parity is established,
- do not expand the world model before the lightweight version proves value,
- do not hide regressions behind different seeds, different demos, or different eval protocols.

---

## 13. Minimal execution order

Follow this order. Do not reorder it casually.

### Step 1. Freeze the current repo as a historical branch

Keep it for reference, but stop treating it as the final architecture.

### Step 2. Build a clean trunk interface

Get one official trunk path working and reproducible.

### Step 3. Implement adapter no-op mode

This must pass no-op equivalence tests before any learning claims are made.

### Step 4. Port only the strong ideas

Port:

- reveal-state head,
- task-routed macro proposal prior,
- retrieve-feasibility gate.

Do **not** port the full heavy memory/world-model stack by default.

### Step 5. Add a lightweight transition model

Only over reveal-state summaries.

### Step 6. Train adapter-only on proxy supervision and ranking

Focus on oracle-gap reduction and reveal-state prediction quality.

### Step 7. Run anchor parity benchmark

If parity fails, stop and simplify.

### Step 8. Run target-like public subset and existing proxy suite

If gains appear only on the internal proxy and nowhere else, the architecture is still too benchmark-shaped.

### Step 9. Add garment/deformable benchmark

This is the most likely path to a strong suitcase/clothes result.

### Step 10. Prepare the real-world data plan only after sim evidence is strong

The real teleop benchmark should come after a strong sim go/no-go decision, not before.

---

## 14. What “novel enough” should mean here

The novelty should be modest and crisp. It does not need to be a giant new architecture.

A reasonable novelty claim is:

- a foundation-policy-compatible structured adapter,
- explicit reveal-state variables for elastic occlusion,
- task-routed reveal macros,
- retrieve-feasibility gating,
- lightweight reveal-state rollout / reranking.

This is a good paper if:

- the base trunk is respected,
- the adapter is small,
- the gains are real on the target-like tasks,
- the general-task regression is small,
- the ablations isolate the contribution cleanly.

This is **not** a good paper if the final story is:

- “we replaced the trunk,”
- “we added many modules and one of them helped a bit,”
- “we route to a better checkpoint for each task,”
- “we get non-zero on one RLBench branch because demo retrieval rescued it.”

---

## 15. Proposed paper positioning (for later)

If the system works, position it against two groups of prior work.

### General bimanual policy baselines

- PerAct2,
- AnyBimanual,
- 3D FlowMatch Actor,
- optionally CoFreeVLA as an “auxiliary structured head” comparator.

### Target-task conceptual neighbors

- active bag reveal/retrieve from demonstrations,
- active perception for manipulation under occlusion,
- bag-specific SOI latent-dynamics models,
- occlusion-aware hidden-object retrieval in clutter,
- garment clutter retrieval / garment manipulation benchmarks.

The paper should say: generic bimanual foundation policies are good at general dual-arm manipulation, but they lack explicit reveal-state structure for elastic occlusion tasks. The adapter adds that structure while preserving general capability.

---

## 16. Deliverables expected from the developer

The handoff is not complete until the following exist.

### Code deliverables

- clean trunk interface,
- adapter package,
- no-op path,
- lightweight transition model,
- benchmark scripts for anchor, proxy, and target-like subsets,
- required new tests,
- config files for all reported experiments.

### Experimental deliverables

- trunk-only anchor benchmark report,
- adapter-noop parity report,
- full ablation report,
- target-like benchmark report,
- cloth/deformable benchmark report,
- candidate ranking / oracle gap diagnostics,
- reveal-state calibration plots.

### Reporting format

Every report should include:

- exact checkpoint,
- exact demos,
- exact seeds,
- exact task subset,
- exact eval protocol,
- whether the adapter was off / noop / active,
- whether planner/gate/transition model were enabled,
- per-task scores and mean.

No undocumented “special” branches should be used for headline results.

---

## 17. Immediate next actions

1. Pick the trunk to standardize around.
2. Build and validate the no-op wrapper.
3. Strip the adapter down to:
   - reveal-state head,
   - proposal prior,
   - retrieve gate.
4. Replace the heavy world model with a lightweight reveal-state transition model.
5. Run anchor parity.
6. Run proxy ranking and target-like subset.
7. Decide whether memory is dropped permanently.
8. Add garment benchmark integration.

That is the shortest path from the current repo to a defensible paper candidate.

---

## 18. Appendix: repo evidence that motivated this handoff

Relevant repo locations to inspect while implementing:

- Main model stack:
  - `VLAarchtests/code/reveal_vla_bimanual/models/policy.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/backbones.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/reveal_head.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/action_decoder.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/planner.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/observation_memory.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/world_model.py`

- Training / losses:
  - `VLAarchtests/code/reveal_vla_bimanual/train/losses.py`
  - `VLAarchtests/code/reveal_vla_bimanual/train/trainer.py`
  - `VLAarchtests/code/reveal_vla_bimanual/train/build_aligned_proposal_dataset.py`

- Existing tests worth keeping:
  - `VLAarchtests/tests/test_proxy_scripted_bench.py`
  - `VLAarchtests/tests/test_geometry_matters_under_camera_perturbation.py`
  - `VLAarchtests/tests/test_memory_matters_under_high_reocclusion.py`
  - `VLAarchtests/tests/test_rlbench_dataset_rgbd_geometry.py`
  - `VLAarchtests/tests/test_candidate_ranking_loss.py`
  - `VLAarchtests/tests/test_rvt_backbone_forward.py`

- Existing reports that matter:
  - `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary.md`
  - `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md`
  - `reports/true_baseline_compare_subset3_v1/...`
  - `reports/general_task_anchor_20260330_dual_push_buttons/...`
  - `reports/dual_push_nonzero_branch_20260330/...`
  - `reports/dual_push_full_arch_hybrid_20260331/...`

Use those reports as a diagnosis of what is weak, not as proof that the current architecture is already ready.

---

## 19. External references to keep in mind

General bimanual baselines and nearby work:

- PerAct2 benchmark and baselines: https://arxiv.org/abs/2407.00278
- AnyBimanual: https://bimanual.github.io/
- 3D FlowMatch Actor (3DFA): https://arxiv.org/abs/2508.11002
- CoFreeVLA: https://arxiv.org/abs/2601.21712
- ActiveVLA: https://arxiv.org/abs/2601.08325

Target-task conceptual neighbors:

- Vision in Action (active bag reveal/retrieve from human demonstrations): https://arxiv.org/html/2506.15666v1
- Bimanual Deformable Bag Manipulation with SOI neural dynamics: https://arxiv.org/abs/2401.11432
- Occlusion-Aware Search for Object Retrieval in Clutter: https://ieeexplore.ieee.org/document/9197067
- GarmentPile++ / cluttered garment retrieval: https://arxiv.org/abs/2603.04158
- RoboTwin 2.0 benchmark: https://arxiv.org/abs/2506.18088

Add the exact GarmentLab citation separately if that benchmark is included in the final experimental plan.

---

## Final instruction to the implementer

Do not try to rescue the current architecture by adding even more structure. The repo already revealed the answer: the good idea is narrow. Keep the structured reveal-state adapter, keep the retrieve gate, keep task-aware proposals, and force the whole design to prove two things cleanly:

1. it does not break a strong trunk on general bimanual tasks,
2. it improves reveal/retrieve under elastic occlusion.

If both are true, the project is in good shape. If either is false, simplify further rather than expanding again.