File size: 36,238 Bytes
aa584de | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 | # Developer handoff: elastic-occlusion bimanual VLA on 1×L40S
This document is the working handoff for rebuilding the current repo into a credible research system for bimanual reveal/retrieve under elastic occlusion. It supersedes the narrower short-sprint handoff in `handoff/instructions4.md`. The short-sprint document is still useful as a proxy-benchmark checklist, but it is not enough for the next stage.
The project goal is not to invent a new general-purpose trunk. The goal is to attach a small, structured adapter to a strong public bimanual trunk, preserve general-task competence, and create measurable gains on tasks that look like the future real benchmark:
1. foliage reveal/retrieve (push leaves aside, keep them aside, then retrieve a hidden target),
2. bag opening/retrieve (open a compliant container enough for the other arm to see and retrieve),
3. folded-clothes suitcase retrieval (slight lift/separate, preserve fold structure, retrieve a hidden object).
The right short-term success condition is:
- general public tasks: `trunk + adapter` should be in the same ballpark as `trunk alone`,
- reveal/retrieve-like tasks: `trunk + adapter` should beat `trunk alone` and other generic baselines.
The adapter is where the novelty should live. The trunk should stay as standard and defensible as possible.
---
## 1. What the current repo actually shows
### 1.1 Core architecture in the repo
The current codebase contains three relevant policy families in `VLAarchtests/code/reveal_vla_bimanual/models/policy.py`:
- `BackboneOnlyPolicy`
- `InteractionBimanualPolicy`
- `ElasticRevealBimanualPolicy`
The latest elastic path is the relevant one for this project. It is a monolithic policy composed of:
- a frozen VL backbone wrapper (`models/backbones.py`),
- dual observation memory (`models/observation_memory.py`),
- an interaction / elastic-occlusion state head (`models/reveal_head.py`),
- a coordinated chunk decoder with task-routed proposal modes (`models/action_decoder.py`),
- an elastic-occlusion rollout model (`models/world_model.py`),
- a cascade planner with structured feasibility logic (`models/planner.py`).
This is the part worth preserving conceptually. The important fields in the current elastic state head already match the real tasks unusually well:
- visibility / target confidence,
- access corridor / insertion corridor,
- persistence / release-collapse,
- reocclusion,
- disturbance / damage,
- fold preservation / top-layer stability / lift-too-much risk.
Those signals are directly relevant to the future foliage, bag, and clothes tasks.
### 1.2 What the current repo does **not** show
The repo does **not** currently show that the latest full architecture is a strong general bimanual policy. It also does **not** show that the heavy memory + world-model stack is helping.
The most important current findings from the repo are:
- In the proxy sprint summary, the base model is below random and below oracle on its own candidate set.
- Disabling memory improves the proxy mean over the base model.
- The planner matters.
- The best proxy result comes from task-routed checkpoint routing, not from a single unified learned model.
- The non-zero RLBench result in the “dual_push_nonzero” line is not the kind of fair architecture win needed for a paper claim. It is a retrieval/retargeting positive control, not a clean full-policy benchmark result.
- The local general-task anchor results are not yet strong enough to treat the current custom trunk path as a valid base.
### 1.3 What the existing tests are good for
The current tests are mostly of three kinds:
1. **Contract / plumbing tests**
These verify shapes, token paths, geometry propagation, dataset fields, shortlist plumbing, RVT wrapper output shapes, etc. They are useful and should stay.
2. **Directional proxy tests**
These verify that scripted “good” reveal actions beat obviously bad ones in the procedural proxy benchmark. These are useful because they validate that the proxy metrics are at least pointed in the correct direction.
3. **Evidence-free competence surrogates**
Several tests only prove that a feature toggles or produces different tensors (for example memory and geometry tests). They do not prove the feature helps task performance.
The current test suite is therefore necessary, but not sufficient. It validates software correctness and some proxy metric sanity. It does not validate benchmark strength.
### 1.4 Repo findings that should drive the redesign
Treat the following as the main empirical lessons from the current repo:
- **Keep**: explicit reveal-state prediction.
- **Keep**: task-aware macro proposals.
- **Keep**: feasibility gating for retrieve-like actions.
- **Question**: dual memory (current evidence is weak to negative).
- **Question**: heavy token-level world model (too expensive and under-justified).
- **Question**: local custom RVT path as the main scientific trunk (currently too fragile).
- **Do not claim**: that the current non-zero RLBench result proves the architecture works.
---
## 2. Research claim to target
Do **not** try to claim a new general VLA or a new general bimanual architecture.
The claim should be:
> A structured adapter for foundation bimanual policies that improves reveal/retrieve under elastic occlusion by predicting reveal-state variables (visibility, access, persistence, reocclusion, disturbance, fold preservation), generating task-routed reveal macros, and enforcing retrieve feasibility before execution.
This claim is much cleaner, and much closer to what the repo already hints at.
That claim is only defensible if all of the following are true:
1. the base trunk is strong and reproduced fairly,
2. the adapter causes little or no regression on public general tasks,
3. the adapter gives a real gain on public or proxy tasks that stress reveal/retrieve,
4. the gain cannot be explained away by trivial checkpoint routing alone.
---
## 3. Target system after refactor
The target architecture should be **smaller** than the current monolithic one.
### 3.1 Trunk
Use a strong public bimanual trunk with a faithful evaluation path. In order of preference:
1. **3D FlowMatch Actor (3DFA)**, if code/checkpoints are practical to evaluate fairly.
2. **Official PerAct2 / RVT-style stack**, if 3DFA is not practical.
3. **Official AnyBimanual** as a transfer baseline and possibly as the starting trunk if its code path is the most stable locally.
Do not continue making CLIP the scientific center of the project. The trunk should be imported as a stable base, not reinvented.
### 3.2 Adapter
The adapter should sit **above** the trunk and should be trainable with the trunk frozen. It should contain exactly four core pieces:
1. **Reveal-state head**
Predict scalar and low-resolution field variables for:
- visibility,
- access corridor / insertion corridor,
- persistence / support stability,
- reocclusion,
- disturbance,
- task-specific metrics (bag mouth, foliage opening, cloth fold preservation, top-layer stability).
2. **Task-routed proposal prior**
Generate a small number of macro proposal modes appropriate for the task family. Keep the current proposal vocabulary idea, but do not let it become a separate checkpoint-routing story. The task routing should be internal to one model.
3. **Retrieve-feasibility gate**
Before choosing retrieve or insert-like modes, require predicted access, persistence/support, and reocclusion to satisfy thresholds or a learned gating classifier. This is one of the strongest, most defensible pieces of structure in the current repo.
4. **Lightweight reveal-transition model**
A small transition model over reveal-state variables only. Do **not** keep the full token-heavy spatial rollout model as the default. Predict the next reveal-state summary (and optionally a tiny field map), not the entire scene token stack.
### 3.3 Optional memory
Make memory optional and minimal. The default should be either:
- no memory, or
- a very short reveal-state cache / exponential filter over a few recent steps.
Do not keep the current dual selective memory as a default dependency until it proves value on benchmark success.
### 3.4 No-op / fallback path
This is critical.
The adapter must have a true **no-op** mode:
- on tasks outside the reveal/retrieve family, or
- when the adapter is uncertain,
the system should fall back to the trunk’s default action distribution or trunk shortlist.
This is the cleanest way to preserve general-task performance.
---
## 4. Concrete code changes
The fastest path is not to patch the current monolith forever. Refactor it into a stable trunk interface plus a narrow adapter package.
### 4.1 `models/backbones.py`
#### Changes required
- Replace the current “backbone wrapper does everything” mentality with a narrow `TrunkInterface`.
- Standardize outputs:
- latent tokens,
- optional trunk action distribution or trunk candidate set,
- any geometry features the adapter is allowed to use.
- Remove the assumption that CLIP is the main path.
- Keep the current CLIP path only as a development/debug baseline.
- Treat the current RVT wrapper as provisional until it matches an official evaluation path.
- Add an explicit `NoOpAdapterCompatibleTrunkOutput` schema so the adapter can be bypassed without shape hacks.
#### Why
The current wrapper mixes too much custom logic into the backbone path. That makes it hard to tell whether failures are due to the trunk, geometry handling, or the adapter.
### 4.2 `models/policy.py`
#### Changes required
Split the current policy into:
- `FoundationTrunkPolicy`
- `ElasticOcclusionAdapter`
- `AdapterWrappedPolicy`
The wrapped policy should support three modes:
- `adapter_off`
- `adapter_noop`
- `adapter_active`
The execution contract should be:
1. get trunk tokens and trunk action / trunk candidates,
2. if adapter inactive or low confidence, return trunk action,
3. otherwise rank a small candidate set using the adapter and return the selected chunk.
#### Why
This makes no-regression testing possible. Right now the current monolithic policy hides whether the trunk is still intact.
### 4.3 `models/reveal_head.py`
#### Changes required
Keep the best part of the repo, but simplify and formalize it.
- Split outputs into:
- task-agnostic reveal variables,
- task-specific metrics,
- optional low-res spatial fields.
- Add masks so task-specific losses only apply when valid.
- Preserve the cloth-specific metrics. They are one of the best differentiators for the future suitcase benchmark.
- Add explicit calibration support (for example confidence outputs or logits) so the state head can be evaluated independently of policy success.
#### Why
The reveal-state head is likely the publishable core. It needs cleaner interfaces and evaluation, not more entanglement.
### 4.4 `models/action_decoder.py`
#### Changes required
Keep the current task proposal vocabulary concept, but tighten it:
- candidate 0 must always be the trunk/base action,
- proposal candidates must stay near the trunk action initially,
- proposal mode families should be internal to one model, not external checkpoint routing,
- add a generic fallback mode family for non-target tasks,
- keep explicit mode names for analysis and paper figures.
Current task families to preserve and clean up:
- foliage: `widen_gap`, `maintain_gap`, `insert_actor`, `retrieve`, etc.
- bag: `widen_mouth`, `maintain_mouth`, `probe_inside`, `insert_actor`, `retrieve`
- cloth: `lift_edge`, `separate_layer`, `stabilize_fold`, `maintain_lift`, `insert_actor`, `retrieve`
#### Why
The proposal vocabulary is useful. The current best proxy result already suggests task specialization matters. But the specialization must become a principled internal prior, not a checkpoint-routing workaround.
### 4.5 `models/planner.py`
#### Changes required
Refactor the planner into two explicit parts:
1. **hard/soft feasibility gate**
2. **residual reranker**
The gate should use reveal-state variables only. The reranker can use the lightweight transition model and proposal logits.
Also add:
- a clean `identity` planning mode,
- a clean `trunk_only` selection mode,
- an `adapter_confidence` score,
- diagnostics for every rejected retrieve-like candidate.
#### Why
The current planner appears to be one of the few useful parts of the architecture. It needs to be isolated and made measurable.
### 4.6 `models/world_model.py`
#### Changes required
Do not keep the current full token-heavy elastic rollout model as the default research path.
Replace it with a much smaller transition model over:
- scalar reveal-state summaries,
- optionally one or two low-res fields (for example access map and support map),
- action macro / candidate metadata.
The transition model should predict:
- next visibility,
- next access corridor,
- next persistence / support,
- next reocclusion,
- next disturbance / fold metrics.
Only reintroduce a heavier spatial model if the lightweight model clearly helps.
#### Why
The current rollout model is too expensive and too under-validated for a single-L40S research loop.
### 4.7 `models/observation_memory.py`
#### Changes required
Default behavior should be:
- disabled, or
- replaced by a tiny reveal-state cache.
If the current dual memory stays in the repo, mark it experimental. Either wire the suppression margin logic properly or remove it. Right now it looks half-finished and the current proxy evidence is not favorable.
#### Why
Memory is currently a likely liability, not a likely differentiator.
### 4.8 `train/losses.py`
#### Changes required
Reweight the training objective around what is actually learnable and measurable.
Required losses:
- action BC / trajectory loss from the trunk policy path,
- **candidate ranking loss** against oracle utility within the same candidate set,
- proposal mode classification / assignment,
- reveal-state regression/classification,
- retrieve-feasibility gate loss,
- lightweight transition-model loss,
- **no-regression distillation** from the trunk on general tasks,
- optional calibration loss for reveal-state confidence.
Losses to demote or remove unless justified by results:
- large generic memory losses,
- large token-level world-model reconstruction losses.
#### Why
The repo already points to the correct training target: close the gap to the oracle chooser on the candidate set. That is much better than adding more latent machinery.
### 4.9 `train/trainer.py`
#### Changes required
Add explicit training regimes:
- `trunk_only_eval`
- `adapter_noop_eval`
- `adapter_train_frozen_trunk`
- `adapter_finetune_light`
- `general_distillation_only`
- `proxy_rank_only`
Freeze the trunk by default. Any trunk finetuning should be delayed until the adapter proves itself.
Also add a single switch that controls whether evaluation is:
- trunk only,
- adapter no-op,
- adapter active,
- adapter active with planner off,
- adapter active with gate off.
#### Why
The current trainer still reflects an architecture-search phase. The next phase needs controlled, fair comparisons.
### 4.10 Dataset / teacher generation code
Relevant existing code already exists for proposal alignment and proxy data generation. Reuse it, but narrow it.
Required changes:
- generate oracle labels and candidate utilities for proxy tasks,
- export reveal-state supervision targets explicitly,
- export candidate-mode assignments,
- export task metadata separately from free-form language,
- ensure every sample can be evaluated in:
- trunk-only mode,
- no-op mode,
- adapter mode.
Do not let text strings be the only task family signal. Explicit task metadata must be available.
---
## 5. What to keep, what to remove, what to treat as provisional
### Keep
- explicit reveal-state variables,
- task-routed macro proposal vocabulary,
- retrieve-feasibility gate,
- geometry-aware observation path,
- existing proxy scripted sanity tests,
- candidate-ranking supervision.
### Remove from the default path
- heavy dual memory as a required component,
- full token-heavy rollout model,
- any claim based on checkpoint routing alone,
- any claim based on the retargeted demo positive control.
### Treat as provisional
- custom RVT wrapper,
- local RLBench general benchmark path until official baseline reproduction is clean,
- memory-related gains unless they appear in a proper task-success benchmark.
---
## 6. Benchmark strategy
The benchmark plan should be staged. Do not jump straight to a full RLBench sweep.
### Phase 0. Baseline reproduction
Goal: prove that the evaluation path is real.
Required outcome:
- at least one official public trunk reproduces a known strong score on a small anchor subset,
- one anchor task should match a public or repo-validated release closely enough to trust the pipeline.
If this fails, stop and fix evaluation before touching the adapter further.
### Phase 1. General-task anchor set
Use a small public anchor set that is broad enough to catch regressions, but small enough to run repeatedly on one L40S.
Recommended anchor tasks:
- coordinated push box,
- coordinated lift ball,
- dual push buttons,
- handover item,
- lift tray.
These are not the target application tasks. They are regression sentries.
Acceptance criterion:
- `adapter_noop` should be essentially identical to `trunk_only`,
- `adapter_active` should remain in the same ballpark as `trunk_only`,
- any loss on the anchor mean must be small and explainable.
If the trunk itself is weak on the chosen anchor set, replace the trunk. Do not proceed with a weak base.
### Phase 2. Existing proxy benchmark (internal shaping only)
Use the existing proxy suite as an architecture-shaping instrument, not as the main paper result.
Preserve the narrow stress slices from the existing handoff:
- nominal,
- high reocclusion,
- camera perturbation.
Preserve the task slices:
- foliage,
- bag,
- cloth.
Keep the simple baselines:
- random,
- candidate 0,
- oracle chooser,
- scripted good/bad actions.
What to measure beyond success:
- reveal-state prediction correlation with proxy ground truth,
- ranking correlation with oracle utility,
- gate precision/recall for unsafe retrieve attempts,
- effect of proposal families by task,
- reocclusion after reveal,
- fold-preservation metrics on cloth slices.
### Phase 3. Public target-like tasks
This is the most important new benchmark stage.
The future real benchmark does not exist yet, so approximate it with public tasks that stress:
- containment opening,
- hidden-object access,
- cluttered retrieval,
- partial reveal before retrieve,
- disturbance control.
Use a small public target-like subset first. Candidate tasks to prioritize:
- open drawer,
- put item in drawer / retrieve-like container interactions,
- take shoes out of box,
- shell game,
- pick up notebook,
- straighten rope.
The exact final subset can change if some tasks prove unstable, but the principle should stay the same: these tasks should be more target-like than the anchor set.
### Phase 4. Deformable / garment benchmarks
For the clothes/suitcase direction, add a public deformable benchmark as soon as the infrastructure is stable.
Priority order:
1. GarmentLab (if practical to run),
2. GarmentPile or similar garment-clutter retrieval benchmarks,
3. other public deformable-manipulation tasks only if they are easy to integrate.
This stage matters because the suitcase task is probably the strongest future novelty angle.
### Phase 5. Broader robustness benchmark
Only after phases 0–4 succeed, consider a broader dual-arm benchmark such as RoboTwin 2.0 or a wider RLBench/PerAct2 sweep.
Do not do this early. It is expensive and not yet the right bottleneck.
---
## 7. Baselines that must be included
At minimum, every meaningful experiment should compare against:
1. **the same trunk alone**
This is the most important baseline.
2. **the same trunk with adapter disabled / no-op**
This isolates whether the wrapper is already damaging performance.
3. **PerAct2**
Use official or faithful public numbers / code path.
4. **AnyBimanual**
Important because the repo already references it and because transfer from strong unimanual data is relevant.
5. **3DFA**, if evaluation is practical
This is the strongest public benchmark baseline for bimanual PerAct2-style tasks and should be the aspirational reference.
Optional if practical:
- CoFreeVLA (useful because it is also a structured auxiliary head on top of a VLA),
- ActiveVLA (conceptually relevant for active perception),
- task-specific academic comparisons in writing (Vision in Action, bag SOI model, garment retrieval papers), even if not reproduced in code.
---
## 8. Required ablations
The current repo already shows that “big architecture blob vs baseline” is not informative enough. The next paper-worthy evidence must isolate the actual source of gain.
Run the following ablations in order.
### General-task ablations
1. `trunk_only`
2. `trunk + adapter_noop`
3. `trunk + adapter_active (gate only)`
4. `trunk + adapter_active (gate + reveal-state head)`
5. `trunk + adapter_active (gate + reveal-state + proposal prior)`
6. `trunk + adapter_active (gate + reveal-state + proposal prior + lightweight transition model)`
7. optional: `+ short reveal cache`
Interpretation target:
- general tasks should not fall apart as structure is added,
- if they do, the adapter is not sufficiently no-op-safe.
### Target-like ablations
1. full adapter
2. no gate
3. no proposal prior
4. no task conditioning
5. no lightweight transition model
6. no geometry
7. no depth
8. no cloth-specific metrics (for the cloth slice only)
9. checkpoint routing only (to prove that routing alone is not the full story)
Interpretation target:
- gate should matter,
- proposal prior should matter,
- cloth-specific metrics should matter on cloth-like slices,
- routing alone should not account for the final gain.
### Memory ablations
Do these late, not early:
- no memory,
- short reveal cache,
- current dual memory.
If dual memory does not clearly beat no memory on actual task success, drop it.
---
## 9. Tests to add or rewrite
The current suite is decent for plumbing. It now needs benchmark-faithfulness tests and ablation-protecting tests.
### 9.1 Keep the current useful tests
Keep and maintain the existing tests that verify:
- proxy scripted benchmark directionality,
- geometry path activation under camera perturbation,
- dataset geometry fields,
- proposal shortlist plumbing,
- task metadata override behavior,
- candidate ranking loss behavior.
### 9.2 Add the following tests
#### `test_trunk_noop_equivalence.py`
With adapter disabled or in strict no-op mode, verify that:
- action mean / candidate set match the trunk path exactly (or within tight tolerance),
- no planner or routing side effects change outputs.
This is the single most important new test.
#### `test_trunk_interface_official_eval_parity.py`
For one selected official trunk and one frozen batch, verify that:
- preprocessing,
- camera handling,
- token layout,
- action decoding,
match the official implementation path closely enough to trust the wrapper.
This should be an integration test, not just a shape test.
#### `test_adapter_gate_blocks_unsafe_retrieve.py`
Build explicit synthetic reveal states where retrieve should and should not be allowed. The current planner already contains similar logic; formalize it into a direct unit test.
#### `test_reveal_state_metric_calibration.py`
For proxy env rollouts with known labels, verify that predicted reveal-state metrics correlate with the simulator labels and are not collapsed.
#### `test_candidate_ranking_matches_oracle.py`
Given a batch with oracle candidate utilities from the proxy env, verify that training reduces the gap between the model ranker and the oracle chooser.
This should be a real learned ranking test, not just a toy-array loss test.
#### `test_task_specific_loss_masking.py`
Verify that foliage metrics are not trained on bag/cloth tasks, bag metrics are not trained on foliage/cloth tasks, etc.
#### `test_cloth_specific_metrics_affect_selection.py`
For cloth-like proxy cases, verify that fold-preservation / lift-too-much risk can change candidate selection even when nominal reachability is similar.
#### `test_general_eval_protocol_is_identical.py`
Ensure that `trunk_only`, `adapter_noop`, and `adapter_active` all use the same observation stack, same action horizon, same task subset, and same evaluation step budget.
This prevents accidental unfairness.
### 9.3 Promote some current tests from “unit” to “benchmark guardrails”
The following should become part of the required CI / pre-run checklist:
- geometry path smoke test,
- dataset geometry/history test,
- no-op equivalence test,
- benchmark protocol identity test.
---
## 10. Metrics that matter
Do not rely on success alone.
### General-task metrics
- task success,
- return (if available),
- variance across seeds,
- regression relative to trunk.
### Target-like metrics
- success,
- visibility gain,
- access / insertion corridor gain,
- persistence / support gain,
- reocclusion after reveal,
- disturbance / damage,
- fold preservation (cloth-like slice),
- unsafe retrieve rate,
- oracle gap on candidate ranking.
### Calibration / diagnostics
- correlation of predicted reveal metrics with simulator ground truth,
- gate precision / recall,
- candidate shortlist recall of oracle candidate,
- proposal mode usage by task,
- fallback rate to trunk.
The fallback rate matters. If the adapter almost never activates, then the system may preserve general performance but not meaningfully help target tasks. If it always activates and hurts general tasks, it is not safe enough.
---
## 11. Acceptance gates
These gates should determine whether to continue, simplify, or stop.
### Gate A. Trunk validity
Pass only if an official or faithful trunk path is clearly non-trivial on the anchor set.
If this fails, stop. Do not spend effort on the adapter yet.
### Gate B. No-op safety
Pass only if `adapter_noop` is effectively identical to `trunk_only`.
If this fails, stop and fix the wrapper.
### Gate C. General-task parity
Pass only if `adapter_active` stays in the same ballpark as `trunk_only` on the anchor set. A small drop may be acceptable, but not a collapse.
Use a simple rule for the first pass:
- mean absolute drop on the anchor set should be very small,
- no single anchor task should collapse catastrophically.
If the adapter is helping target-like tasks but causing a broad general-task collapse, the architecture is not ready.
### Gate D. Target-like gain
Pass only if the full adapter clearly beats:
- trunk alone,
- adapter no-op,
- random,
- candidate 0,
- and ideally narrows the oracle gap.
This is where the architecture starts to become scientifically interesting.
### Gate E. Non-trivial novelty
Pass only if the gain is not explained almost entirely by checkpoint routing or trivial task labels. The final model should be a single structured adapter, not a routing script disguised as a model.
---
## 12. Recommended training strategy on 1×L40S
The compute constraint implies one principle: **do not retrain the trunk repeatedly**.
### Use this strategy
1. Choose one strong trunk.
2. Freeze it.
3. Build the adapter around it.
4. Run many cheap adapter experiments.
5. Only consider light trunk finetuning after the adapter is already useful.
### Practical guidelines
- mixed precision everywhere practical,
- gradient checkpointing if needed,
- keep candidate counts modest,
- keep rollout horizon short,
- keep the transition model lightweight,
- train on a narrow but representative task set,
- log every candidate-level diagnostic needed for offline analysis.
### What not to do
- do not repeatedly launch full-scale trunk retraining,
- do not run full benchmark sweeps before anchor parity is established,
- do not expand the world model before the lightweight version proves value,
- do not hide regressions behind different seeds, different demos, or different eval protocols.
---
## 13. Minimal execution order
Follow this order. Do not reorder it casually.
### Step 1. Freeze the current repo as a historical branch
Keep it for reference, but stop treating it as the final architecture.
### Step 2. Build a clean trunk interface
Get one official trunk path working and reproducible.
### Step 3. Implement adapter no-op mode
This must pass no-op equivalence tests before any learning claims are made.
### Step 4. Port only the strong ideas
Port:
- reveal-state head,
- task-routed macro proposal prior,
- retrieve-feasibility gate.
Do **not** port the full heavy memory/world-model stack by default.
### Step 5. Add a lightweight transition model
Only over reveal-state summaries.
### Step 6. Train adapter-only on proxy supervision and ranking
Focus on oracle-gap reduction and reveal-state prediction quality.
### Step 7. Run anchor parity benchmark
If parity fails, stop and simplify.
### Step 8. Run target-like public subset and existing proxy suite
If gains appear only on the internal proxy and nowhere else, the architecture is still too benchmark-shaped.
### Step 9. Add garment/deformable benchmark
This is the most likely path to a strong suitcase/clothes result.
### Step 10. Prepare the real-world data plan only after sim evidence is strong
The real teleop benchmark should come after a strong sim go/no-go decision, not before.
---
## 14. What “novel enough” should mean here
The novelty should be modest and crisp. It does not need to be a giant new architecture.
A reasonable novelty claim is:
- a foundation-policy-compatible structured adapter,
- explicit reveal-state variables for elastic occlusion,
- task-routed reveal macros,
- retrieve-feasibility gating,
- lightweight reveal-state rollout / reranking.
This is a good paper if:
- the base trunk is respected,
- the adapter is small,
- the gains are real on the target-like tasks,
- the general-task regression is small,
- the ablations isolate the contribution cleanly.
This is **not** a good paper if the final story is:
- “we replaced the trunk,”
- “we added many modules and one of them helped a bit,”
- “we route to a better checkpoint for each task,”
- “we get non-zero on one RLBench branch because demo retrieval rescued it.”
---
## 15. Proposed paper positioning (for later)
If the system works, position it against two groups of prior work.
### General bimanual policy baselines
- PerAct2,
- AnyBimanual,
- 3D FlowMatch Actor,
- optionally CoFreeVLA as an “auxiliary structured head” comparator.
### Target-task conceptual neighbors
- active bag reveal/retrieve from demonstrations,
- active perception for manipulation under occlusion,
- bag-specific SOI latent-dynamics models,
- occlusion-aware hidden-object retrieval in clutter,
- garment clutter retrieval / garment manipulation benchmarks.
The paper should say: generic bimanual foundation policies are good at general dual-arm manipulation, but they lack explicit reveal-state structure for elastic occlusion tasks. The adapter adds that structure while preserving general capability.
---
## 16. Deliverables expected from the developer
The handoff is not complete until the following exist.
### Code deliverables
- clean trunk interface,
- adapter package,
- no-op path,
- lightweight transition model,
- benchmark scripts for anchor, proxy, and target-like subsets,
- required new tests,
- config files for all reported experiments.
### Experimental deliverables
- trunk-only anchor benchmark report,
- adapter-noop parity report,
- full ablation report,
- target-like benchmark report,
- cloth/deformable benchmark report,
- candidate ranking / oracle gap diagnostics,
- reveal-state calibration plots.
### Reporting format
Every report should include:
- exact checkpoint,
- exact demos,
- exact seeds,
- exact task subset,
- exact eval protocol,
- whether the adapter was off / noop / active,
- whether planner/gate/transition model were enabled,
- per-task scores and mean.
No undocumented “special” branches should be used for headline results.
---
## 17. Immediate next actions
1. Pick the trunk to standardize around.
2. Build and validate the no-op wrapper.
3. Strip the adapter down to:
- reveal-state head,
- proposal prior,
- retrieve gate.
4. Replace the heavy world model with a lightweight reveal-state transition model.
5. Run anchor parity.
6. Run proxy ranking and target-like subset.
7. Decide whether memory is dropped permanently.
8. Add garment benchmark integration.
That is the shortest path from the current repo to a defensible paper candidate.
---
## 18. Appendix: repo evidence that motivated this handoff
Relevant repo locations to inspect while implementing:
- Main model stack:
- `VLAarchtests/code/reveal_vla_bimanual/models/policy.py`
- `VLAarchtests/code/reveal_vla_bimanual/models/backbones.py`
- `VLAarchtests/code/reveal_vla_bimanual/models/reveal_head.py`
- `VLAarchtests/code/reveal_vla_bimanual/models/action_decoder.py`
- `VLAarchtests/code/reveal_vla_bimanual/models/planner.py`
- `VLAarchtests/code/reveal_vla_bimanual/models/observation_memory.py`
- `VLAarchtests/code/reveal_vla_bimanual/models/world_model.py`
- Training / losses:
- `VLAarchtests/code/reveal_vla_bimanual/train/losses.py`
- `VLAarchtests/code/reveal_vla_bimanual/train/trainer.py`
- `VLAarchtests/code/reveal_vla_bimanual/train/build_aligned_proposal_dataset.py`
- Existing tests worth keeping:
- `VLAarchtests/tests/test_proxy_scripted_bench.py`
- `VLAarchtests/tests/test_geometry_matters_under_camera_perturbation.py`
- `VLAarchtests/tests/test_memory_matters_under_high_reocclusion.py`
- `VLAarchtests/tests/test_rlbench_dataset_rgbd_geometry.py`
- `VLAarchtests/tests/test_candidate_ranking_loss.py`
- `VLAarchtests/tests/test_rvt_backbone_forward.py`
- Existing reports that matter:
- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary.md`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md`
- `reports/true_baseline_compare_subset3_v1/...`
- `reports/general_task_anchor_20260330_dual_push_buttons/...`
- `reports/dual_push_nonzero_branch_20260330/...`
- `reports/dual_push_full_arch_hybrid_20260331/...`
Use those reports as a diagnosis of what is weak, not as proof that the current architecture is already ready.
---
## 19. External references to keep in mind
General bimanual baselines and nearby work:
- PerAct2 benchmark and baselines: https://arxiv.org/abs/2407.00278
- AnyBimanual: https://bimanual.github.io/
- 3D FlowMatch Actor (3DFA): https://arxiv.org/abs/2508.11002
- CoFreeVLA: https://arxiv.org/abs/2601.21712
- ActiveVLA: https://arxiv.org/abs/2601.08325
Target-task conceptual neighbors:
- Vision in Action (active bag reveal/retrieve from human demonstrations): https://arxiv.org/html/2506.15666v1
- Bimanual Deformable Bag Manipulation with SOI neural dynamics: https://arxiv.org/abs/2401.11432
- Occlusion-Aware Search for Object Retrieval in Clutter: https://ieeexplore.ieee.org/document/9197067
- GarmentPile++ / cluttered garment retrieval: https://arxiv.org/abs/2603.04158
- RoboTwin 2.0 benchmark: https://arxiv.org/abs/2506.18088
Add the exact GarmentLab citation separately if that benchmark is included in the final experimental plan.
---
## Final instruction to the implementer
Do not try to rescue the current architecture by adding even more structure. The repo already revealed the answer: the good idea is narrow. Keep the structured reveal-state adapter, keep the retrieve gate, keep task-aware proposals, and force the whole design to prove two things cleanly:
1. it does not break a strong trunk on general bimanual tasks,
2. it improves reveal/retrieve under elastic occlusion.
If both are true, the project is in good shape. If either is false, simplify further rather than expanding again.
|