Add files using upload-large-folder tool

de2fd70 verified 7 days ago

38.9 kB

Developer handoff: structured bimanual reveal-and-retrieve under elastic occlusion

Repo target: lsnu/VLAarchtests (current main, latest post-fix state). This handoff is written against the current elastic_reveal stack, not the older intermediate variants.

1. Project introduction

This project is a structured bimanual policy stack for reveal-and-retrieve tasks under partial observability and deformable or elastic occlusion. The eventual real-world targets are three Dobot X-trainer environments. The first is dense live foliage with hidden fake snails, where one arm must create and maintain a canopy gap while the other arm retrieves the target safely. The second is bag opening and retrieval, where one arm must open and hold the bag mouth while the other arm retrieves the target item. The third is suitcase or folded-cloth retrieval, where one arm must slightly lift and stabilize clothing layers while the other arm retrieves a hidden item without destroying the fold structure.

The current repo already contains the right broad decomposition for this task family. It has a multi-view visual backbone, RGB-D support, an explicit reveal state head, observation memory, a compact world model, a coordinated bimanual action decoder, and a planner. The problem is not the structural idea. The problem is that several important pieces are only partially wired, too compact, or only validated on teacher-shaped proxy data. The current code is a good scaffold. It is not yet strong enough to justify “beats SOTA” claims on either public benchmarks or the three target task families.

The current public evidence should be read narrowly. The most credible positive result in the repo is that RGB-D helps on the proxy benchmark. The planner, world model, and role-symmetry components are not yet validated strongly enough to claim they are the source of the gains. The RLBench / PerAct2 integration is also still mostly a launch and plumbing layer, not a mature benchmark suite.

This handoff therefore has one purpose. Keep the structured reveal-and-retrieve idea, but harden the architecture and evaluation until there is a realistic chance of beating strong bimanual baselines on the three target environments.

2. Current repo status (what exists, what is missing)

The current core files are:

code/reveal_vla_bimanual/models/backbones.py
code/reveal_vla_bimanual/models/multiview_fusion.py
code/reveal_vla_bimanual/models/observation_memory.py
code/reveal_vla_bimanual/models/reveal_head.py
code/reveal_vla_bimanual/models/world_model.py
code/reveal_vla_bimanual/models/action_decoder.py
code/reveal_vla_bimanual/models/planner.py
code/reveal_vla_bimanual/models/policy.py
code/reveal_vla_bimanual/train/losses.py
code/reveal_vla_bimanual/sim_reveal/dataset.py
code/reveal_vla_bimanual/sim_reveal/procedural_envs.py
code/reveal_vla_bimanual/eval/run_reveal_benchmark.py
code/reveal_vla_bimanual/eval/run_rlbench_rollout_eval.py
code/reveal_vla_bimanual/eval/run_peract2_task_sweep.py

The current proxy benchmark already uses the correct three abstract task types (foliage, bag, cloth). That is good. The current dataset code also has explicit no-leak assertions, which is also good.

The current weaknesses are specific and fixable.

First, the geometry path is only partially wired. The backbone produces depth_tokens, geometry_tokens, and camera_tokens, but the policy only forwards RGB, depth, and camera tokens into fusion. The explicit geometry_tokens are dropped before fusion. In addition, camera geometry is incomplete. The current depth adapter encodes intrinsics and camera translation, but not an equally explicit camera rotation representation. For three-camera reveal tasks this is a real omission.

Second, memory is too pooled and too global. The current memory path reduces scene history to pooled tokens before write decisions and bank updates. That is a novelty-gated summary memory. It is not a spatial occlusion memory. That is not enough for “hold the opening”, “the target is still probably behind this flap”, or “reveal progress will collapse if the revealer arm releases now”.

Third, the world model is too compact. It is useful as a scaffold, but not as the state-transition core for elastic foliage, bag apertures, or layered cloth. It currently rolls a compact hidden state rather than a spatial field state. That makes it too weak for counterfactual planning over opening persistence, reocclusion, and safe actor insertion.

Fourth, the planner is not trained on hard enough candidates. The current proxy data generation uses the teacher chunk and mostly Gaussian perturbations around it. That is enough to test ranking near a teacher, but not enough to teach the planner the actual failure modes that matter in these tasks (premature retrieval, releasing the opening, over-disturbing the scene, lifting the wrong cloth edge, etc.).

Fifth, the state head is still too generic. It predicts a useful set of reveal-related fields, but it does not yet expose the right task-specific latent variables for foliage, bag, and folded cloth. Those tasks are not the same. They share the same reveal-and-retrieve pattern, but they do not share the same dominant failure modes.

Sixth, the test suite is mostly contract-level. Those tests are useful, but they do not yet prove that the structured components work behaviorally. The RLBench side is similar. The launch smoke is only a plumbing check. The actual rollout evaluator exists, but it needs to become the main public benchmark path.

3. The main design decision

Do not collapse this into a generic monolithic VLA. That is not the likely win condition for these tasks.

The highest-probability path is a stronger visual backbone plus an explicit structured reveal-and-retrieve stack. The reason is simple. Your target tasks are asymmetric, partially observable, persistence-sensitive, and reocclusion-sensitive. One arm often has to create and maintain a temporary affordance that only exists because of that arm’s continued state. Generic end-to-end BC can sometimes imitate the behavior, but these tasks strongly reward explicit representations of opening quality, hold persistence, target belief, reocclusion risk, and actor feasibility.

The structured architecture should stay. It should just become spatial, task-aware, and evaluated honestly.

4. Mandatory code changes

4.1 Fix and strengthen the geometry path

Files to change:

models/backbones.py
models/multiview_fusion.py
models/policy.py
tests/test_rgbd_forward_contract.py (extend)
Add new tests: tests/test_geometry_tokens_propagate.py, tests/test_camera_rotation_geometry.py

Exact changes:

In models/policy.py, update the image encoding path so that geometry_tokens are passed from backbone.encode_images(..., return_aux=True) into the fusion module. Right now the policy forwards rgb_tokens, depth_tokens, and camera_tokens, but not geometry_tokens. This should be corrected first because it is an actual information-drop bug.

In models/multiview_fusion.py, update the fusion interface to accept explicit geometry_tokens. The geometry attention path should fuse from a real concatenation or gated combination of [depth_tokens, geometry_tokens, camera_tokens], rather than synthesizing “geometry” only from the surviving depth and camera paths. Keep the existing gated cross-attention pattern, but make the geometry path explicit and inspectable.

In models/backbones.py, upgrade DepthPatchAdapter so that geometry features include camera orientation. Use a 6D rotation representation or a normalized quaternion plus translation. Also add per-patch viewing ray directions derived from intrinsics and camera pose. The three target environments all rely on view geometry and persistent multi-view correspondence. The current translation-only pose treatment is too weak.

Add config flags that actually do something. The current use_camera_geometry style config needs to gate a real path, not just exist as a dormant option. Add separate switches for use_depth_tokens, use_geometry_tokens, and use_camera_pose_tokens so ablations are clean.

Why this matters: the foliage and bag tasks are especially sensitive to camera geometry because small apparent gaps can be fake from one viewpoint and usable from another. The actor feasibility estimate should depend on geometry, not just appearance.

4.2 Replace pooled novelty memory with spatial reveal memory

Files to change:

models/observation_memory.py
models/policy.py
models/reveal_head.py
Add new tests: tests/test_spatial_memory_occlusion_persistence.py, tests/test_memory_slot_write_gating.py, tests/test_reocclusion_memory_regression.py

Exact changes:

Keep the current memory modules as a fallback baseline, but add a new default path that stores low-resolution spatial memory instead of only pooled history summaries. The simplest realistic version is a two-branch memory:

scene memory: a small bank of view-conditioned or canonicalized spatial tokens for persistent geometry and support structure;
belief memory: a spatial target-belief / reveal-state memory that carries uncertainty explicitly.

The memory does not need to be large. An 8×8 or 12×12 field token grid per view (or a shared canonical field) is enough. The key requirement is that the write gate becomes spatial or slot-wise, not global only. The model must be able to update “the mouth is open here” without overwriting “the target is probably still here”.

Add explicit channels or latent heads for:

newly revealed regions
still-visible regions
reoccluded regions
persistent hold or opening quality
target belief uncertainty

The world model and planner should consume this spatial memory directly. Do not average it away before planning.

Why this matters: a reveal-and-retrieve policy that forgets where the useful opening was, or where the hidden object probably still is, will look competent in one-step imitation and fail in multi-step retrieval.

4.3 Replace the compact world model with a spatial rollout model

Files to change:

models/world_model.py
models/policy.py
train/losses.py
Add new tests: tests/test_world_model_null_rollout.py, tests/test_world_model_identity_rollout.py, tests/test_world_model_field_consistency.py, tests/test_world_model_task_adapter.py

Exact changes:

Keep the current compact GRU world model only as an ablation. The default model should become a spatial latent rollout over field tokens or low-resolution maps. A realistic implementation is a ConvGRU or a token-wise recurrent transformer over a low-resolution field state. The world-model state should contain at least:

target belief field
visibility or reveal field
actor feasibility / corridor field
opening quality or hold quality field
persistence field
disturbance / damage risk field
reocclusion risk field
support stability field

Add task conditioning directly into the world model. A learned task embedding (foliage, bag, cloth) should modulate the transition. The dynamics are not the same and should not be forced into one unstructured transition model.

Retain explicit ablation modes inside models/world_model.py:

identity_rollout
null_rollout
compact_rollout (the current baseline)
spatial_rollout (new default)

These ablations must be real and deterministic. The world-model ablation confusion in the current repo shows why this needs to be explicit and unit-tested.

Why this matters: the planner can only beat a simple decoder if its counterfactual rollouts capture persistence and collapse. Without a spatial world model, the “maintain opening while actor advances” pattern will be under-modeled.

4.4 Make the reveal head task-aware

Files to change:

models/reveal_head.py
train/losses.py
sim_reveal/dataset.py
sim_reveal/procedural_envs.py
Add new tests: tests/test_task_conditioned_head_shapes.py, tests/test_task_metric_monotonicity.py

Exact changes:

Add a task embedding to the reveal head. Keep the shared trunk, but use task-specific adapters or low-rank heads for the final outputs. The head should still produce common fields, but each task must also expose the state variables that actually matter.

For foliage, add:

gap width or reveal corridor width
canopy strain / damage risk
occluder return tendency (reocclusion after release)
target visibility confidence under flexible occluders

For bag, add:

mouth aperture width or area
rim endpoint or rim grasp quality
hold quality
rim slip risk
insertable actor corridor

For cloth or suitcase, add:

layer separation quality
fold-preservation score
insertion corridor
top-layer stability
“lift too much” risk

The current generic fields (actor_feasibility_field, persistence_field, risk_field, uncertainty_field, reocclusion) are useful, but they are not enough. The planner needs the task-specific variables because the right action for bag opening is not the right action for layered cloth.

4.5 Replace Gaussian candidate noise with semantic macro candidates plus continuous refinement

Files to change:

models/action_decoder.py
models/planner.py
sim_reveal/dataset.py
sim_reveal/procedural_envs.py
Add new tests: tests/test_candidate_macro_coverage.py, tests/test_planner_reocclusion_gating.py, tests/test_proposal_semantic_diversity.py

Exact changes:

Keep the current proposal mechanism as a fallback. The default candidate set should become a set of semantic macro modes, each refined by continuous deltas.

The candidate vocabulary should be task-aware.

For foliage:

sweep_left
sweep_right
pin_canopy
widen_gap
maintain_gap
insert_actor
retrieve

For bag:

pin_left_rim
pin_right_rim
widen_mouth
maintain_mouth
probe_inside
insert_actor
retrieve

For cloth:

lift_edge
separate_layer
stabilize_fold
maintain_lift
insert_actor
retrieve

Represent these as discrete proposal tokens or a macro head in action_decoder.py, then produce continuous chunk deltas conditioned on the chosen macro. The planner should shortlist across macro families first and refine within each family second. That prevents “all candidates are tiny perturbations around the same wrong idea”.

In models/planner.py, add hard feasibility gates before utility aggregation. Do not let the planner prefer “retrieve now” if actor feasibility, hold quality, or support stability are below threshold. Use worst-step or CVaR-style penalties for reocclusion and collapse, rather than only mean penalties. These tasks fail on bad tails, not just on averages.

Why this matters: the current planner is too dependent on easy local ranking. Real reveal-and-retrieve requires semantically different plans, not just slightly different noise vectors.

4.6 Change the loss stack to supervise what actually matters

Files to change:

train/losses.py
train/trainer.py (if needed for logging)
Add new tests: tests/test_candidate_ranking_loss.py, tests/test_phase_labels_not_action_only.py, tests/test_planner_gradient_flow.py

Exact changes:

Reduce dependence on heuristic phase labels inferred from the current action chunk. That heuristic is acceptable for early bootstrapping, but it should not remain the main source of phase supervision. Prefer simulator-side phase or subgoal labels where available. If those are not reliable, phase should be a weak auxiliary, not a strong driver.

Add pairwise or listwise ranking loss over candidate action chunks using actual rollout utility labels. These labels should come from simulated outcomes, not just from “teacher is first, noise is worse”.

Add consistency losses:

predicted opening quality should correlate with rollout persistence
predicted reocclusion should correlate with actual collapse after release
predicted uncertainty should be calibrated against outcome uncertainty or visibility error

Lower the relative weight of pure behavior cloning once ranking and rollout supervision are reliable. This project should not stay as BC-with-many-auxiliaries.

5. Mandatory data-generation changes

Files to change:

sim_reveal/dataset.py
sim_reveal/procedural_envs.py
Add new tests: tests/test_dataset_hard_negative_presence.py, tests/test_no_leak_with_new_labels.py, tests/test_teacher_audit.py

Exact changes:

The dataset generation path must stop relying on teacher-plus-Gaussian-noise as the dominant source of planner candidates. Keep the teacher as one source, but add hard negative families that reflect actual task failures.

Required negative families for all three tasks:

premature retrieve: actor attempts retrieval before corridor and hold quality are sufficient;
reveal-with-release: revealer creates an opening but fails to maintain it;
over-disturbance: revealer opens aggressively but causes collapse or damage risk;
wrong-side or wrong-edge reveal: the opening is created in a useless place;
delayed actor entry: revealer holds too long and wastes time or destabilizes the scene;
actor path through weak corridor: actor enters where access exists visually but not safely.

Required task-specific negative families:

For foliage:

swipe that increases visibility briefly but induces immediate reocclusion;
push direction that hides the target from the actor side;
gap on the wrong side of the target.

For bag:

one-rim lift that slips instead of widening the mouth;
opening wide enough visually but not stable enough for actor insertion;
actor reaches through the fabric instead of through the aperture.

For cloth:

lift too high and destroy fold structure;
lift the wrong layer;
retrieve path that drags clothing and unfolds the stack.

The dataset should record candidate-level rollout outcomes for every candidate chunk:

success
reveal achieved
visibility AUC
hold persistence
reocclusion rate
disturbance cost
fold-preservation (cloth)
mouth aperture / hold quality (bag)
damage proxy / gap width (foliage)

This candidate-level outcome table should be the source of planner labels.

Also add a teacher audit report. The current teacher is a useful bootstrap, but it is not enough to assume it is good. The audit should compare the teacher against reveal-only, retrieve-only, no-hold, and random policy baselines on the current proxy suite.

6. Small but mandatory engineering cleanups

These changes do not change model quality directly, but they reduce evaluation ambiguity and future regressions.

In tests/conftest.py, remove the hardcoded /workspace/VLAarchtests/code/reveal_vla_bimanual path. Replace it with a path derived from Path(__file__).resolve() so tests run anywhere.

In eval/run_rlbench_rollout_eval.py, preserve richer episode traces. Save chosen macro mode, planner scores, confidence, predicted reocclusion, path recoveries, noop fallbacks, and whether support-mode conditioning was enabled.

In eval/run_reveal_benchmark.py, stop using only the default 24 episodes for serious comparisons. Keep 24 as a smoke benchmark, but add a “serious” mode at 100 or 200 episodes per proxy.

In eval/run_reveal_benchmark.py, explicitly report chunk_commit_steps and do not leave the main reveal benchmark at a commit horizon of zero by default. These tasks are not purely one-step reactive.

In the eval reporting utilities, add bootstrap confidence intervals and paired-seed comparisons. The differences you care about are often a few percentage points. Unpaired noisy comparisons are not enough.

7. Exact new tests to verify the implementation

The current repo has contract tests. Keep them. Add the following behavioral tests.

7.1 Geometry and fusion tests

tests/test_geometry_tokens_propagate.py

Construct a tiny batch with fixed RGB and depth. Modify only camera rotation. Verify that:

geometry_tokens change,
the fused scene representation changes when geometry is enabled,
the fused scene representation does not change when geometry is disabled.

tests/test_camera_rotation_geometry.py

Use two cameras with identical translation and different rotation. Verify that the policy representation is rotation-sensitive after the geometry fix. This should fail on the current code and pass after the change.

7.2 Spatial memory tests

tests/test_spatial_memory_occlusion_persistence.py

Use a scripted proxy sequence where the target is briefly visible, then fully occluded, then visible again. Verify that belief memory retains a localized target belief during occlusion and sharpens it after reappearance. This should test both persistence and uncertainty.

tests/test_memory_slot_write_gating.py

Feed a scene where only the opening region changes. Verify that only a minority of memory slots or cells update. This prevents global overwriting.

tests/test_reocclusion_memory_regression.py

Create a scripted “open then release” sequence. Verify that memory tracks reocclusion and that predicted hold quality declines.

7.3 World-model tests

tests/test_world_model_null_rollout.py

Assert that null_rollout returns an exact or near-exact identity state and does not apply unintended updates.

tests/test_world_model_identity_rollout.py

Assert that identity_rollout preserves state across steps while leaving logging fields consistent.

tests/test_world_model_field_consistency.py

Roll out one deterministic proxy step and compare predicted next-step fields against simulator privileged fields. Enforce MAE thresholds per field, not only a single scalar.

tests/test_world_model_task_adapter.py

Use the same initial field state with different task embeddings. Verify that transitions differ in a consistent way. This catches dead task-conditioning code paths.

7.4 Candidate and planner tests

tests/test_candidate_macro_coverage.py

Verify that the proposal generator returns at least one candidate from each required macro family when requested.

tests/test_planner_reocclusion_gating.py

Create a scripted case where one candidate retrieves immediately but causes opening collapse, and another candidate maintains the opening first. Verify that the planner picks the maintain-first plan.

tests/test_proposal_semantic_diversity.py

Do not measure diversity only by vector distance. Also verify macro-family diversity and rollout outcome diversity.

7.5 Task-head tests

tests/test_task_conditioned_head_shapes.py

Verify output presence and shapes for all common fields and all task-specific fields.

tests/test_task_metric_monotonicity.py

Use small synthetic perturbations:

increase aperture in bag: opening_quality should increase;
increase canopy gap in foliage: actor_feasibility should increase;
over-lift cloth: fold_preservation should decrease.

These are not full scientific tests, but they catch dead or miswired heads quickly.

7.6 Dataset and leakage tests

tests/test_dataset_hard_negative_presence.py

Sample dataset items and verify that candidate sets contain hard negative families, not just teacher-centered noise.

tests/test_no_leak_with_new_labels.py

Extend the no-leak assertions to cover all new task-specific labels and maps. The proxy dataset must keep using rendered observations only on the input side.

tests/test_teacher_audit.py

Require the teacher to beat random, retrieve-only, and reveal-only on the proxy metrics. If the teacher itself is weak, the whole planner training signal is questionable.

7.7 Scripted proxy behavior suite

Add a new deterministic behavioral test suite, for example under tests/test_proxy_scripted_bench.py.

This suite should include 10 to 20 deterministic seeds per task with hand-designed initial states. The expected winner should be obvious.

Required scripted cases:

bag: maintain_mouth should beat retrieve immediately on hold persistence and success;
foliage: pin_canopy should beat random_swipe on reocclusion and visibility AUC;
cloth: stabilize_fold should beat lift_high on fold-preservation and success.

The full model does not need to be perfect on these, but the planner should select the intended candidate at least 80 percent of the time.

8. Exact benchmark plan to estimate performance

Separate the benchmarks into two layers. The first layer verifies that the implementation behaves correctly. The second estimates real performance against baselines.

8.1 Layer A: implementation-verification benchmarks

These are not publication benchmarks. They are gates.

Run the full unit and integration suite after every architecture milestone:

PYTHONPATH=code/reveal_vla_bimanual pytest tests -q

After the new behavioral tests are added, require all of the following before moving on:

all geometry propagation tests pass;
the scripted proxy suite passes;
world-model null and identity ablations pass exactly;
candidate macro coverage passes;
no-leak assertions pass with new task fields.

Then run a deterministic proxy smoke benchmark on fixed seeds (for example 10 per task) to catch obvious regressions:

cd code/reveal_vla_bimanual
python -m eval.run_reveal_benchmark \
  --model full=/abs/path/checkpoint.pt \
  --episodes 10 \
  --proxies foliage bag cloth \
  --chunk-commit-steps 4 \
  --output-root /abs/path/reports/reveal_smoke

This benchmark is only for regression detection. It is not a performance claim.

8.2 Layer B: strengthened proxy benchmark (main task-aligned benchmark now)

This should become the main internal benchmark until real teleop data exists.

Use the existing foliage, bag, and cloth proxies, but strengthen them and evaluate seriously:

at least 100 deterministic seeds per proxy for final comparisons;
paired-seed evaluation across all ablations;
chunk commit horizons of at least 4, and also report a 0/2/4 sweep once;
no teacher involvement during evaluation.

Run the base benchmark:

cd code/reveal_vla_bimanual
python -m eval.run_reveal_benchmark \
  --model full=/abs/path/checkpoint.pt \
  --episodes 100 \
  --proxies foliage bag cloth \
  --chunk-commit-steps 4 \
  --output-root /abs/path/reports/reveal_full

Run required paired ablations from the same checkpoint family or retrained checkpoints:

no geometry tokens
no spatial memory
compact world model instead of spatial
no planner
planner with Gaussian candidates only
no task-conditioned head
no support-mode conditioning

The proxy benchmark must report at least these metrics:

retrieve success
reveal success
target visibility AUC
actor-feasibility AUC
hold persistence
reocclusion rate
disturbance cost
planner top-1 on candidate rollouts
world-model next-step MAE
uncertainty calibration
candidate ranking NDCG

Add task-specific metrics:

foliage: gap width, damage proxy, release-collapse rate
bag: aperture width or area, rim slip rate, insertion success
cloth: fold-preservation score, layer separation quality, drag-induced disturbance

Acceptance gate for continuing toward public baseline comparison:

the full model should beat the current repo’s RGB-D baseline on mean proxy success and on at least two of the three proxies;
planner-on should beat planner-off on at least two of the three proxies and on hard-negative candidate ranking;
spatial world model should beat compact and null rollouts on persistence and reocclusion prediction;
task-conditioned head should beat generic head on at least one task-specific metric per target task.

8.3 Layer C: RLBench / PerAct2 bimanual rollout benchmark

The repo already has the right hook for this. Use run_rlbench_rollout_eval.py and run_peract2_task_sweep.py as the main public benchmark entry points. Do not treat run_peract2_launch_smoke.py as evaluation. It is only a launch check.

Run the full existing PerAct2 13-task split from sim_rlbench/task_splits.py::PERACT2_BIMANUAL_TASKS:

cd code/reveal_vla_bimanual
python -m eval.run_peract2_task_sweep \
  --checkpoint /abs/path/checkpoint.pt \
  --output-root /abs/path/reports/peract2_13 \
  --episodes-per-task 25 \
  --episode-length 20 \
  --resolution 224 \
  --chunk-commit-steps 4 \
  --allow-unsupervised-planning \
  --headless

Also run direct single-task evaluations when debugging:

cd code/reveal_vla_bimanual
python -m eval.run_rlbench_rollout_eval \
  --checkpoint /abs/path/checkpoint.pt \
  --output-dir /abs/path/reports/rlbench_debug \
  --tasks RightOpenDrawer \
  --episodes-per-task 25 \
  --episode-length 20 \
  --resolution 224 \
  --plan \
  --chunk-commit-steps 4 \
  --allow-unsupervised-planning \
  --headless

This benchmark is not a direct match to the three target tasks, but it is the main public bimanual sanity check. It measures whether the structured modifications hurt or help general bimanual competence.

Required comparisons on this benchmark:

current repo best checkpoint
full improved model
no-planner ablation
compact world model ablation
no geometry ablation
no task-conditioning ablation

If external baseline code is available, evaluate against:

PerAct2
InterACT
VoxAct-B
AnyBimanual

If compute allows, also compare against foundation-scale baselines as a separate category:

TwinVLA
RDT-1B

Fairness requirements:

same camera setup if possible (front plus both wrists);
same resolution;
same episode length and reset policy;
same task list;
same number of evaluation episodes;
report whether baselines use extra large-scale pretraining.

This benchmark should report:

per-task success
mean success
mean return
path recoveries
noop fallbacks
plan-on vs plan-off
per-episode planner traces for error analysis

8.4 Layer D: deformable-manipulation public benchmarks

You do not yet have custom teleop data, so the closest public matches for bag and cloth should be used now.

Recommended benchmarks:

DeformableRavens
SoftGym cloth tasks
DaXBench cloth tasks

The exact subset should be chosen based on available tasks, but the mapping is straightforward. Bag-like opening and insertion tasks are the closest public proxy for the bag environment. Cloth lifting, separation, and manipulation tasks are the closest public proxy for the suitcase environment. There is no equally good public foliage benchmark, so the strengthened foliage proxy will remain the main stand-in until custom data exists.

Required evaluation protocol:

same observation modalities across methods;
same action horizon where possible;
same random seeds;
same episode budgets;
report both success and task-specific deformation metrics.

Add at least these extra metrics on the deformable benchmarks:

opening quality or aperture quality
hold persistence under actor motion
reocclusion or collapse rate
disturbance cost
fold-preservation or structural-preservation score

8.5 Layer E: optional exploratory / active-perception benchmark

If EFM-10 or BAP code and data are actually available when implementation starts, add them. That benchmark is conceptually close to your task family because it measures exploratory plus focused manipulation under occlusion. Do not block the project on it if code is not readily usable.

8.6 Layer F: optional broad generalization benchmark

If time allows, add RoboTwin 2.0 as a general bimanual breadth check. It is not a direct target-task match, but it is useful for checking whether the structured reveal-and-retrieve bias damages general bimanual transfer.

9. Baseline strategy

There are two baseline groups and they should not be mixed carelessly.

The first group is matched-data or matched-setting baselines. These are the most useful for fair engineering comparison. Use PerAct2, InterACT, VoxAct-B, and AnyBimanual if code is available in a compatible evaluation setting.

The second group is foundation-scale baselines. These are useful, but they are not apples-to-apples unless you disclose the pretraining and model scale difference clearly. Use TwinVLA and RDT-1B in this category if compute allows.

Do not declare victory because the improved model beats the current repo checkpoint. That is a necessary condition, not the target claim.

10. Acceptance criteria for “ready to collect real data”

Do not move into expensive teleop collection until all of the following are true.

First, the geometry and spatial memory tests pass and stay green for multiple checkpoints.

Second, the strengthened proxy benchmark shows that the full model beats the current repo baseline convincingly. The minimum bar should be improvement in overall proxy success plus improvement on at least two of the three task types.

Third, planner-on must beat planner-off on hard-negative ranking and on task success. If the planner does not beat the decoder baseline, then the explicit planning stack is not yet earning its complexity.

Fourth, the spatial world model must beat compact and null baselines on persistence and reocclusion prediction. If it does not, the planning story is still too weak.

Fifth, the improved model should at least match strong public baselines on the RLBench / PerAct2 suite, and ideally exceed them on the tasks most related to opening, holding, uncovering, and coordinated retrieval. If it is significantly behind there, the architecture is still too immature.

11. Recommended implementation order

Phase 1 should fix information flow and evaluation trustworthiness. Implement geometry propagation, camera orientation encoding, and path cleanup in tests/conftest.py. Then add the new geometry tests and rerun the current proxy benchmark.

Phase 2 should add task-aware semantic candidates and hard-negative data generation. This is the fastest path to making the planner meaningful without yet rewriting the full memory and world model stack.

Phase 3 should add task-conditioned reveal outputs and the strengthened proxy metrics. At this stage the proxy benchmark should start reflecting the real task failure modes.

Phase 4 should replace pooled memory and compact rollout with the new spatial memory and spatial world model. This is the biggest change and should only happen after the eval harness can tell whether it helped.

Phase 5 should run the full internal ablation suite, then RLBench / PerAct2, then deformable public benchmarks, and only then decide whether the architecture is strong enough to justify real-data collection.

12. What to avoid

Do not treat launch smoke as performance evaluation.

Do not keep teacher-centered Gaussian candidates as the main planner supervision source.

Do not remove task structure in favor of a generic monolithic BC model unless the structured architecture clearly fails. Nothing in the current repo proves that.

Do not use only mean success. These tasks need persistence, reocclusion, and structural-preservation metrics.

Do not claim the current planner or current world model are validated. They are not, yet.

13. Minimal first patch set (the first pull request)

If only one implementation sprint is possible before deeper refactors, the first pull request should contain exactly this:

fix geometry_tokens propagation from backbone to fusion to policy output;
add camera rotation encoding in DepthPatchAdapter;
add tests/test_geometry_tokens_propagate.py and tests/test_camera_rotation_geometry.py;
replace hardcoded path logic in tests/conftest.py;
extend run_reveal_benchmark.py reporting to save chunk_commit_steps, bootstrap confidence intervals, and paired-seed summaries;
add semantic macro candidates in action_decoder.py without yet deleting the Gaussian fallback;
add hard negative candidate generation in sim_reveal/procedural_envs.py;
add the deterministic scripted proxy benchmark suite.

This first patch set will not make the model SOTA. It will make the repo trustworthy enough to support the larger refactor.

14. Reference links

Repo root:
https://huggingface.co/lsnu/VLAarchtests/tree/main

Public benchmark / baseline references to align against:
PerAct2 / RLBench2 bimanual benchmark: https://bimanual.github.io/
InterACT: https://dannyran123.github.io/interact/
VoxAct-B: https://voxact-b.github.io/
AnyBimanual: https://anybimanual.github.io/
TwinVLA: https://twinvla.github.io/
RDT-1B: https://rdt-robotics.github.io/rdt-robotics/
DeformableRavens: https://deformableravens.github.io/
SoftGym: https://sites.google.com/view/softgym/home
DaXBench: https://daxbench.github.io/
EFM / BAP: https://efmanipulation.github.io/
RoboTwin 2.0: https://robotwin-platform.github.io/

15. Final recommendation

The architecture should be pursued, but only in a narrower and more explicit form: task-structured bimanual reveal-and-retrieve under elastic occlusion. The current repo is close enough to that idea to be worth continuing. The most important next step is not collecting real data yet. It is making the geometry path real, making the planner learn from hard failure cases, and making the world model spatial enough that “maintain the opening while the other arm retrieves” is something the system can actually predict rather than merely imitate.