File size: 36,238 Bytes
aa584de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
# Developer handoff: elastic-occlusion bimanual VLA on 1×L40S

This document is the working handoff for rebuilding the current repo into a credible research system for bimanual reveal/retrieve under elastic occlusion. It supersedes the narrower short-sprint handoff in `handoff/instructions4.md`. The short-sprint document is still useful as a proxy-benchmark checklist, but it is not enough for the next stage.

The project goal is not to invent a new general-purpose trunk. The goal is to attach a small, structured adapter to a strong public bimanual trunk, preserve general-task competence, and create measurable gains on tasks that look like the future real benchmark:

1. foliage reveal/retrieve (push leaves aside, keep them aside, then retrieve a hidden target),
2. bag opening/retrieve (open a compliant container enough for the other arm to see and retrieve),
3. folded-clothes suitcase retrieval (slight lift/separate, preserve fold structure, retrieve a hidden object).

The right short-term success condition is:

- general public tasks: `trunk + adapter` should be in the same ballpark as `trunk alone`,
- reveal/retrieve-like tasks: `trunk + adapter` should beat `trunk alone` and other generic baselines.

The adapter is where the novelty should live. The trunk should stay as standard and defensible as possible.

---

## 1. What the current repo actually shows

### 1.1 Core architecture in the repo

The current codebase contains three relevant policy families in `VLAarchtests/code/reveal_vla_bimanual/models/policy.py`:

- `BackboneOnlyPolicy`
- `InteractionBimanualPolicy`
- `ElasticRevealBimanualPolicy`

The latest elastic path is the relevant one for this project. It is a monolithic policy composed of:

- a frozen VL backbone wrapper (`models/backbones.py`),
- dual observation memory (`models/observation_memory.py`),
- an interaction / elastic-occlusion state head (`models/reveal_head.py`),
- a coordinated chunk decoder with task-routed proposal modes (`models/action_decoder.py`),
- an elastic-occlusion rollout model (`models/world_model.py`),
- a cascade planner with structured feasibility logic (`models/planner.py`).

This is the part worth preserving conceptually. The important fields in the current elastic state head already match the real tasks unusually well:

- visibility / target confidence,
- access corridor / insertion corridor,
- persistence / release-collapse,
- reocclusion,
- disturbance / damage,
- fold preservation / top-layer stability / lift-too-much risk.

Those signals are directly relevant to the future foliage, bag, and clothes tasks.

### 1.2 What the current repo does **not** show

The repo does **not** currently show that the latest full architecture is a strong general bimanual policy. It also does **not** show that the heavy memory + world-model stack is helping.

The most important current findings from the repo are:

- In the proxy sprint summary, the base model is below random and below oracle on its own candidate set.
- Disabling memory improves the proxy mean over the base model.
- The planner matters.
- The best proxy result comes from task-routed checkpoint routing, not from a single unified learned model.
- The non-zero RLBench result in the “dual_push_nonzero” line is not the kind of fair architecture win needed for a paper claim. It is a retrieval/retargeting positive control, not a clean full-policy benchmark result.
- The local general-task anchor results are not yet strong enough to treat the current custom trunk path as a valid base.

### 1.3 What the existing tests are good for

The current tests are mostly of three kinds:

1. **Contract / plumbing tests**  
   These verify shapes, token paths, geometry propagation, dataset fields, shortlist plumbing, RVT wrapper output shapes, etc. They are useful and should stay.

2. **Directional proxy tests**  
   These verify that scripted “good” reveal actions beat obviously bad ones in the procedural proxy benchmark. These are useful because they validate that the proxy metrics are at least pointed in the correct direction.

3. **Evidence-free competence surrogates**  
   Several tests only prove that a feature toggles or produces different tensors (for example memory and geometry tests). They do not prove the feature helps task performance.

The current test suite is therefore necessary, but not sufficient. It validates software correctness and some proxy metric sanity. It does not validate benchmark strength.

### 1.4 Repo findings that should drive the redesign

Treat the following as the main empirical lessons from the current repo:

- **Keep**: explicit reveal-state prediction.
- **Keep**: task-aware macro proposals.
- **Keep**: feasibility gating for retrieve-like actions.
- **Question**: dual memory (current evidence is weak to negative).
- **Question**: heavy token-level world model (too expensive and under-justified).
- **Question**: local custom RVT path as the main scientific trunk (currently too fragile).
- **Do not claim**: that the current non-zero RLBench result proves the architecture works.

---

## 2. Research claim to target

Do **not** try to claim a new general VLA or a new general bimanual architecture.

The claim should be:

> A structured adapter for foundation bimanual policies that improves reveal/retrieve under elastic occlusion by predicting reveal-state variables (visibility, access, persistence, reocclusion, disturbance, fold preservation), generating task-routed reveal macros, and enforcing retrieve feasibility before execution.

This claim is much cleaner, and much closer to what the repo already hints at.

That claim is only defensible if all of the following are true:

1. the base trunk is strong and reproduced fairly,
2. the adapter causes little or no regression on public general tasks,
3. the adapter gives a real gain on public or proxy tasks that stress reveal/retrieve,
4. the gain cannot be explained away by trivial checkpoint routing alone.

---

## 3. Target system after refactor

The target architecture should be **smaller** than the current monolithic one.

### 3.1 Trunk

Use a strong public bimanual trunk with a faithful evaluation path. In order of preference:

1. **3D FlowMatch Actor (3DFA)**, if code/checkpoints are practical to evaluate fairly.
2. **Official PerAct2 / RVT-style stack**, if 3DFA is not practical.
3. **Official AnyBimanual** as a transfer baseline and possibly as the starting trunk if its code path is the most stable locally.

Do not continue making CLIP the scientific center of the project. The trunk should be imported as a stable base, not reinvented.

### 3.2 Adapter

The adapter should sit **above** the trunk and should be trainable with the trunk frozen. It should contain exactly four core pieces:

1. **Reveal-state head**  
   Predict scalar and low-resolution field variables for:
   - visibility,
   - access corridor / insertion corridor,
   - persistence / support stability,
   - reocclusion,
   - disturbance,
   - task-specific metrics (bag mouth, foliage opening, cloth fold preservation, top-layer stability).

2. **Task-routed proposal prior**  
   Generate a small number of macro proposal modes appropriate for the task family. Keep the current proposal vocabulary idea, but do not let it become a separate checkpoint-routing story. The task routing should be internal to one model.

3. **Retrieve-feasibility gate**  
   Before choosing retrieve or insert-like modes, require predicted access, persistence/support, and reocclusion to satisfy thresholds or a learned gating classifier. This is one of the strongest, most defensible pieces of structure in the current repo.

4. **Lightweight reveal-transition model**  
   A small transition model over reveal-state variables only. Do **not** keep the full token-heavy spatial rollout model as the default. Predict the next reveal-state summary (and optionally a tiny field map), not the entire scene token stack.

### 3.3 Optional memory

Make memory optional and minimal. The default should be either:

- no memory, or
- a very short reveal-state cache / exponential filter over a few recent steps.

Do not keep the current dual selective memory as a default dependency until it proves value on benchmark success.

### 3.4 No-op / fallback path

This is critical.

The adapter must have a true **no-op** mode:

- on tasks outside the reveal/retrieve family, or
- when the adapter is uncertain,

the system should fall back to the trunk’s default action distribution or trunk shortlist.

This is the cleanest way to preserve general-task performance.

---

## 4. Concrete code changes

The fastest path is not to patch the current monolith forever. Refactor it into a stable trunk interface plus a narrow adapter package.

### 4.1 `models/backbones.py`

#### Changes required

- Replace the current “backbone wrapper does everything” mentality with a narrow `TrunkInterface`.
- Standardize outputs:
  - latent tokens,
  - optional trunk action distribution or trunk candidate set,
  - any geometry features the adapter is allowed to use.
- Remove the assumption that CLIP is the main path.
- Keep the current CLIP path only as a development/debug baseline.
- Treat the current RVT wrapper as provisional until it matches an official evaluation path.
- Add an explicit `NoOpAdapterCompatibleTrunkOutput` schema so the adapter can be bypassed without shape hacks.

#### Why

The current wrapper mixes too much custom logic into the backbone path. That makes it hard to tell whether failures are due to the trunk, geometry handling, or the adapter.

### 4.2 `models/policy.py`

#### Changes required

Split the current policy into:

- `FoundationTrunkPolicy`
- `ElasticOcclusionAdapter`
- `AdapterWrappedPolicy`

The wrapped policy should support three modes:

- `adapter_off`
- `adapter_noop`
- `adapter_active`

The execution contract should be:

1. get trunk tokens and trunk action / trunk candidates,
2. if adapter inactive or low confidence, return trunk action,
3. otherwise rank a small candidate set using the adapter and return the selected chunk.

#### Why

This makes no-regression testing possible. Right now the current monolithic policy hides whether the trunk is still intact.

### 4.3 `models/reveal_head.py`

#### Changes required

Keep the best part of the repo, but simplify and formalize it.

- Split outputs into:
  - task-agnostic reveal variables,
  - task-specific metrics,
  - optional low-res spatial fields.
- Add masks so task-specific losses only apply when valid.
- Preserve the cloth-specific metrics. They are one of the best differentiators for the future suitcase benchmark.
- Add explicit calibration support (for example confidence outputs or logits) so the state head can be evaluated independently of policy success.

#### Why

The reveal-state head is likely the publishable core. It needs cleaner interfaces and evaluation, not more entanglement.

### 4.4 `models/action_decoder.py`

#### Changes required

Keep the current task proposal vocabulary concept, but tighten it:

- candidate 0 must always be the trunk/base action,
- proposal candidates must stay near the trunk action initially,
- proposal mode families should be internal to one model, not external checkpoint routing,
- add a generic fallback mode family for non-target tasks,
- keep explicit mode names for analysis and paper figures.

Current task families to preserve and clean up:

- foliage: `widen_gap`, `maintain_gap`, `insert_actor`, `retrieve`, etc.
- bag: `widen_mouth`, `maintain_mouth`, `probe_inside`, `insert_actor`, `retrieve`
- cloth: `lift_edge`, `separate_layer`, `stabilize_fold`, `maintain_lift`, `insert_actor`, `retrieve`

#### Why

The proposal vocabulary is useful. The current best proxy result already suggests task specialization matters. But the specialization must become a principled internal prior, not a checkpoint-routing workaround.

### 4.5 `models/planner.py`

#### Changes required

Refactor the planner into two explicit parts:

1. **hard/soft feasibility gate**
2. **residual reranker**

The gate should use reveal-state variables only. The reranker can use the lightweight transition model and proposal logits.

Also add:

- a clean `identity` planning mode,
- a clean `trunk_only` selection mode,
- an `adapter_confidence` score,
- diagnostics for every rejected retrieve-like candidate.

#### Why

The current planner appears to be one of the few useful parts of the architecture. It needs to be isolated and made measurable.

### 4.6 `models/world_model.py`

#### Changes required

Do not keep the current full token-heavy elastic rollout model as the default research path.

Replace it with a much smaller transition model over:

- scalar reveal-state summaries,
- optionally one or two low-res fields (for example access map and support map),
- action macro / candidate metadata.

The transition model should predict:

- next visibility,
- next access corridor,
- next persistence / support,
- next reocclusion,
- next disturbance / fold metrics.

Only reintroduce a heavier spatial model if the lightweight model clearly helps.

#### Why

The current rollout model is too expensive and too under-validated for a single-L40S research loop.

### 4.7 `models/observation_memory.py`

#### Changes required

Default behavior should be:

- disabled, or
- replaced by a tiny reveal-state cache.

If the current dual memory stays in the repo, mark it experimental. Either wire the suppression margin logic properly or remove it. Right now it looks half-finished and the current proxy evidence is not favorable.

#### Why

Memory is currently a likely liability, not a likely differentiator.

### 4.8 `train/losses.py`

#### Changes required

Reweight the training objective around what is actually learnable and measurable.

Required losses:

- action BC / trajectory loss from the trunk policy path,
- **candidate ranking loss** against oracle utility within the same candidate set,
- proposal mode classification / assignment,
- reveal-state regression/classification,
- retrieve-feasibility gate loss,
- lightweight transition-model loss,
- **no-regression distillation** from the trunk on general tasks,
- optional calibration loss for reveal-state confidence.

Losses to demote or remove unless justified by results:

- large generic memory losses,
- large token-level world-model reconstruction losses.

#### Why

The repo already points to the correct training target: close the gap to the oracle chooser on the candidate set. That is much better than adding more latent machinery.

### 4.9 `train/trainer.py`

#### Changes required

Add explicit training regimes:

- `trunk_only_eval`
- `adapter_noop_eval`
- `adapter_train_frozen_trunk`
- `adapter_finetune_light`
- `general_distillation_only`
- `proxy_rank_only`

Freeze the trunk by default. Any trunk finetuning should be delayed until the adapter proves itself.

Also add a single switch that controls whether evaluation is:

- trunk only,
- adapter no-op,
- adapter active,
- adapter active with planner off,
- adapter active with gate off.

#### Why

The current trainer still reflects an architecture-search phase. The next phase needs controlled, fair comparisons.

### 4.10 Dataset / teacher generation code

Relevant existing code already exists for proposal alignment and proxy data generation. Reuse it, but narrow it.

Required changes:

- generate oracle labels and candidate utilities for proxy tasks,
- export reveal-state supervision targets explicitly,
- export candidate-mode assignments,
- export task metadata separately from free-form language,
- ensure every sample can be evaluated in:
  - trunk-only mode,
  - no-op mode,
  - adapter mode.

Do not let text strings be the only task family signal. Explicit task metadata must be available.

---

## 5. What to keep, what to remove, what to treat as provisional

### Keep

- explicit reveal-state variables,
- task-routed macro proposal vocabulary,
- retrieve-feasibility gate,
- geometry-aware observation path,
- existing proxy scripted sanity tests,
- candidate-ranking supervision.

### Remove from the default path

- heavy dual memory as a required component,
- full token-heavy rollout model,
- any claim based on checkpoint routing alone,
- any claim based on the retargeted demo positive control.

### Treat as provisional

- custom RVT wrapper,
- local RLBench general benchmark path until official baseline reproduction is clean,
- memory-related gains unless they appear in a proper task-success benchmark.

---

## 6. Benchmark strategy

The benchmark plan should be staged. Do not jump straight to a full RLBench sweep.

### Phase 0. Baseline reproduction

Goal: prove that the evaluation path is real.

Required outcome:

- at least one official public trunk reproduces a known strong score on a small anchor subset,
- one anchor task should match a public or repo-validated release closely enough to trust the pipeline.

If this fails, stop and fix evaluation before touching the adapter further.

### Phase 1. General-task anchor set

Use a small public anchor set that is broad enough to catch regressions, but small enough to run repeatedly on one L40S.

Recommended anchor tasks:

- coordinated push box,
- coordinated lift ball,
- dual push buttons,
- handover item,
- lift tray.

These are not the target application tasks. They are regression sentries.

Acceptance criterion:

- `adapter_noop` should be essentially identical to `trunk_only`,
- `adapter_active` should remain in the same ballpark as `trunk_only`,
- any loss on the anchor mean must be small and explainable.

If the trunk itself is weak on the chosen anchor set, replace the trunk. Do not proceed with a weak base.

### Phase 2. Existing proxy benchmark (internal shaping only)

Use the existing proxy suite as an architecture-shaping instrument, not as the main paper result.

Preserve the narrow stress slices from the existing handoff:

- nominal,
- high reocclusion,
- camera perturbation.

Preserve the task slices:

- foliage,
- bag,
- cloth.

Keep the simple baselines:

- random,
- candidate 0,
- oracle chooser,
- scripted good/bad actions.

What to measure beyond success:

- reveal-state prediction correlation with proxy ground truth,
- ranking correlation with oracle utility,
- gate precision/recall for unsafe retrieve attempts,
- effect of proposal families by task,
- reocclusion after reveal,
- fold-preservation metrics on cloth slices.

### Phase 3. Public target-like tasks

This is the most important new benchmark stage.

The future real benchmark does not exist yet, so approximate it with public tasks that stress:

- containment opening,
- hidden-object access,
- cluttered retrieval,
- partial reveal before retrieve,
- disturbance control.

Use a small public target-like subset first. Candidate tasks to prioritize:

- open drawer,
- put item in drawer / retrieve-like container interactions,
- take shoes out of box,
- shell game,
- pick up notebook,
- straighten rope.

The exact final subset can change if some tasks prove unstable, but the principle should stay the same: these tasks should be more target-like than the anchor set.

### Phase 4. Deformable / garment benchmarks

For the clothes/suitcase direction, add a public deformable benchmark as soon as the infrastructure is stable.

Priority order:

1. GarmentLab (if practical to run),
2. GarmentPile or similar garment-clutter retrieval benchmarks,
3. other public deformable-manipulation tasks only if they are easy to integrate.

This stage matters because the suitcase task is probably the strongest future novelty angle.

### Phase 5. Broader robustness benchmark

Only after phases 0–4 succeed, consider a broader dual-arm benchmark such as RoboTwin 2.0 or a wider RLBench/PerAct2 sweep.

Do not do this early. It is expensive and not yet the right bottleneck.

---

## 7. Baselines that must be included

At minimum, every meaningful experiment should compare against:

1. **the same trunk alone**  
   This is the most important baseline.

2. **the same trunk with adapter disabled / no-op**  
   This isolates whether the wrapper is already damaging performance.

3. **PerAct2**  
   Use official or faithful public numbers / code path.

4. **AnyBimanual**  
   Important because the repo already references it and because transfer from strong unimanual data is relevant.

5. **3DFA**, if evaluation is practical  
   This is the strongest public benchmark baseline for bimanual PerAct2-style tasks and should be the aspirational reference.

Optional if practical:

- CoFreeVLA (useful because it is also a structured auxiliary head on top of a VLA),
- ActiveVLA (conceptually relevant for active perception),
- task-specific academic comparisons in writing (Vision in Action, bag SOI model, garment retrieval papers), even if not reproduced in code.

---

## 8. Required ablations

The current repo already shows that “big architecture blob vs baseline” is not informative enough. The next paper-worthy evidence must isolate the actual source of gain.

Run the following ablations in order.

### General-task ablations

1. `trunk_only`
2. `trunk + adapter_noop`
3. `trunk + adapter_active (gate only)`
4. `trunk + adapter_active (gate + reveal-state head)`
5. `trunk + adapter_active (gate + reveal-state + proposal prior)`
6. `trunk + adapter_active (gate + reveal-state + proposal prior + lightweight transition model)`
7. optional: `+ short reveal cache`

Interpretation target:

- general tasks should not fall apart as structure is added,
- if they do, the adapter is not sufficiently no-op-safe.

### Target-like ablations

1. full adapter
2. no gate
3. no proposal prior
4. no task conditioning
5. no lightweight transition model
6. no geometry
7. no depth
8. no cloth-specific metrics (for the cloth slice only)
9. checkpoint routing only (to prove that routing alone is not the full story)

Interpretation target:

- gate should matter,
- proposal prior should matter,
- cloth-specific metrics should matter on cloth-like slices,
- routing alone should not account for the final gain.

### Memory ablations

Do these late, not early:

- no memory,
- short reveal cache,
- current dual memory.

If dual memory does not clearly beat no memory on actual task success, drop it.

---

## 9. Tests to add or rewrite

The current suite is decent for plumbing. It now needs benchmark-faithfulness tests and ablation-protecting tests.

### 9.1 Keep the current useful tests

Keep and maintain the existing tests that verify:

- proxy scripted benchmark directionality,
- geometry path activation under camera perturbation,
- dataset geometry fields,
- proposal shortlist plumbing,
- task metadata override behavior,
- candidate ranking loss behavior.

### 9.2 Add the following tests

#### `test_trunk_noop_equivalence.py`

With adapter disabled or in strict no-op mode, verify that:

- action mean / candidate set match the trunk path exactly (or within tight tolerance),
- no planner or routing side effects change outputs.

This is the single most important new test.

#### `test_trunk_interface_official_eval_parity.py`

For one selected official trunk and one frozen batch, verify that:

- preprocessing,
- camera handling,
- token layout,
- action decoding,

match the official implementation path closely enough to trust the wrapper.

This should be an integration test, not just a shape test.

#### `test_adapter_gate_blocks_unsafe_retrieve.py`

Build explicit synthetic reveal states where retrieve should and should not be allowed. The current planner already contains similar logic; formalize it into a direct unit test.

#### `test_reveal_state_metric_calibration.py`

For proxy env rollouts with known labels, verify that predicted reveal-state metrics correlate with the simulator labels and are not collapsed.

#### `test_candidate_ranking_matches_oracle.py`

Given a batch with oracle candidate utilities from the proxy env, verify that training reduces the gap between the model ranker and the oracle chooser.

This should be a real learned ranking test, not just a toy-array loss test.

#### `test_task_specific_loss_masking.py`

Verify that foliage metrics are not trained on bag/cloth tasks, bag metrics are not trained on foliage/cloth tasks, etc.

#### `test_cloth_specific_metrics_affect_selection.py`

For cloth-like proxy cases, verify that fold-preservation / lift-too-much risk can change candidate selection even when nominal reachability is similar.

#### `test_general_eval_protocol_is_identical.py`

Ensure that `trunk_only`, `adapter_noop`, and `adapter_active` all use the same observation stack, same action horizon, same task subset, and same evaluation step budget.

This prevents accidental unfairness.

### 9.3 Promote some current tests from “unit” to “benchmark guardrails”

The following should become part of the required CI / pre-run checklist:

- geometry path smoke test,
- dataset geometry/history test,
- no-op equivalence test,
- benchmark protocol identity test.

---

## 10. Metrics that matter

Do not rely on success alone.

### General-task metrics

- task success,
- return (if available),
- variance across seeds,
- regression relative to trunk.

### Target-like metrics

- success,
- visibility gain,
- access / insertion corridor gain,
- persistence / support gain,
- reocclusion after reveal,
- disturbance / damage,
- fold preservation (cloth-like slice),
- unsafe retrieve rate,
- oracle gap on candidate ranking.

### Calibration / diagnostics

- correlation of predicted reveal metrics with simulator ground truth,
- gate precision / recall,
- candidate shortlist recall of oracle candidate,
- proposal mode usage by task,
- fallback rate to trunk.

The fallback rate matters. If the adapter almost never activates, then the system may preserve general performance but not meaningfully help target tasks. If it always activates and hurts general tasks, it is not safe enough.

---

## 11. Acceptance gates

These gates should determine whether to continue, simplify, or stop.

### Gate A. Trunk validity

Pass only if an official or faithful trunk path is clearly non-trivial on the anchor set.

If this fails, stop. Do not spend effort on the adapter yet.

### Gate B. No-op safety

Pass only if `adapter_noop` is effectively identical to `trunk_only`.

If this fails, stop and fix the wrapper.

### Gate C. General-task parity

Pass only if `adapter_active` stays in the same ballpark as `trunk_only` on the anchor set. A small drop may be acceptable, but not a collapse.

Use a simple rule for the first pass:

- mean absolute drop on the anchor set should be very small,
- no single anchor task should collapse catastrophically.

If the adapter is helping target-like tasks but causing a broad general-task collapse, the architecture is not ready.

### Gate D. Target-like gain

Pass only if the full adapter clearly beats:

- trunk alone,
- adapter no-op,
- random,
- candidate 0,
- and ideally narrows the oracle gap.

This is where the architecture starts to become scientifically interesting.

### Gate E. Non-trivial novelty

Pass only if the gain is not explained almost entirely by checkpoint routing or trivial task labels. The final model should be a single structured adapter, not a routing script disguised as a model.

---

## 12. Recommended training strategy on 1×L40S

The compute constraint implies one principle: **do not retrain the trunk repeatedly**.

### Use this strategy

1. Choose one strong trunk.
2. Freeze it.
3. Build the adapter around it.
4. Run many cheap adapter experiments.
5. Only consider light trunk finetuning after the adapter is already useful.

### Practical guidelines

- mixed precision everywhere practical,
- gradient checkpointing if needed,
- keep candidate counts modest,
- keep rollout horizon short,
- keep the transition model lightweight,
- train on a narrow but representative task set,
- log every candidate-level diagnostic needed for offline analysis.

### What not to do

- do not repeatedly launch full-scale trunk retraining,
- do not run full benchmark sweeps before anchor parity is established,
- do not expand the world model before the lightweight version proves value,
- do not hide regressions behind different seeds, different demos, or different eval protocols.

---

## 13. Minimal execution order

Follow this order. Do not reorder it casually.

### Step 1. Freeze the current repo as a historical branch

Keep it for reference, but stop treating it as the final architecture.

### Step 2. Build a clean trunk interface

Get one official trunk path working and reproducible.

### Step 3. Implement adapter no-op mode

This must pass no-op equivalence tests before any learning claims are made.

### Step 4. Port only the strong ideas

Port:

- reveal-state head,
- task-routed macro proposal prior,
- retrieve-feasibility gate.

Do **not** port the full heavy memory/world-model stack by default.

### Step 5. Add a lightweight transition model

Only over reveal-state summaries.

### Step 6. Train adapter-only on proxy supervision and ranking

Focus on oracle-gap reduction and reveal-state prediction quality.

### Step 7. Run anchor parity benchmark

If parity fails, stop and simplify.

### Step 8. Run target-like public subset and existing proxy suite

If gains appear only on the internal proxy and nowhere else, the architecture is still too benchmark-shaped.

### Step 9. Add garment/deformable benchmark

This is the most likely path to a strong suitcase/clothes result.

### Step 10. Prepare the real-world data plan only after sim evidence is strong

The real teleop benchmark should come after a strong sim go/no-go decision, not before.

---

## 14. What “novel enough” should mean here

The novelty should be modest and crisp. It does not need to be a giant new architecture.

A reasonable novelty claim is:

- a foundation-policy-compatible structured adapter,
- explicit reveal-state variables for elastic occlusion,
- task-routed reveal macros,
- retrieve-feasibility gating,
- lightweight reveal-state rollout / reranking.

This is a good paper if:

- the base trunk is respected,
- the adapter is small,
- the gains are real on the target-like tasks,
- the general-task regression is small,
- the ablations isolate the contribution cleanly.

This is **not** a good paper if the final story is:

- “we replaced the trunk,”
- “we added many modules and one of them helped a bit,”
- “we route to a better checkpoint for each task,”
- “we get non-zero on one RLBench branch because demo retrieval rescued it.”

---

## 15. Proposed paper positioning (for later)

If the system works, position it against two groups of prior work.

### General bimanual policy baselines

- PerAct2,
- AnyBimanual,
- 3D FlowMatch Actor,
- optionally CoFreeVLA as an “auxiliary structured head” comparator.

### Target-task conceptual neighbors

- active bag reveal/retrieve from demonstrations,
- active perception for manipulation under occlusion,
- bag-specific SOI latent-dynamics models,
- occlusion-aware hidden-object retrieval in clutter,
- garment clutter retrieval / garment manipulation benchmarks.

The paper should say: generic bimanual foundation policies are good at general dual-arm manipulation, but they lack explicit reveal-state structure for elastic occlusion tasks. The adapter adds that structure while preserving general capability.

---

## 16. Deliverables expected from the developer

The handoff is not complete until the following exist.

### Code deliverables

- clean trunk interface,
- adapter package,
- no-op path,
- lightweight transition model,
- benchmark scripts for anchor, proxy, and target-like subsets,
- required new tests,
- config files for all reported experiments.

### Experimental deliverables

- trunk-only anchor benchmark report,
- adapter-noop parity report,
- full ablation report,
- target-like benchmark report,
- cloth/deformable benchmark report,
- candidate ranking / oracle gap diagnostics,
- reveal-state calibration plots.

### Reporting format

Every report should include:

- exact checkpoint,
- exact demos,
- exact seeds,
- exact task subset,
- exact eval protocol,
- whether the adapter was off / noop / active,
- whether planner/gate/transition model were enabled,
- per-task scores and mean.

No undocumented “special” branches should be used for headline results.

---

## 17. Immediate next actions

1. Pick the trunk to standardize around.
2. Build and validate the no-op wrapper.
3. Strip the adapter down to:
   - reveal-state head,
   - proposal prior,
   - retrieve gate.
4. Replace the heavy world model with a lightweight reveal-state transition model.
5. Run anchor parity.
6. Run proxy ranking and target-like subset.
7. Decide whether memory is dropped permanently.
8. Add garment benchmark integration.

That is the shortest path from the current repo to a defensible paper candidate.

---

## 18. Appendix: repo evidence that motivated this handoff

Relevant repo locations to inspect while implementing:

- Main model stack:
  - `VLAarchtests/code/reveal_vla_bimanual/models/policy.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/backbones.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/reveal_head.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/action_decoder.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/planner.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/observation_memory.py`
  - `VLAarchtests/code/reveal_vla_bimanual/models/world_model.py`

- Training / losses:
  - `VLAarchtests/code/reveal_vla_bimanual/train/losses.py`
  - `VLAarchtests/code/reveal_vla_bimanual/train/trainer.py`
  - `VLAarchtests/code/reveal_vla_bimanual/train/build_aligned_proposal_dataset.py`

- Existing tests worth keeping:
  - `VLAarchtests/tests/test_proxy_scripted_bench.py`
  - `VLAarchtests/tests/test_geometry_matters_under_camera_perturbation.py`
  - `VLAarchtests/tests/test_memory_matters_under_high_reocclusion.py`
  - `VLAarchtests/tests/test_rlbench_dataset_rgbd_geometry.py`
  - `VLAarchtests/tests/test_candidate_ranking_loss.py`
  - `VLAarchtests/tests/test_rvt_backbone_forward.py`

- Existing reports that matter:
  - `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary.md`
  - `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md`
  - `reports/true_baseline_compare_subset3_v1/...`
  - `reports/general_task_anchor_20260330_dual_push_buttons/...`
  - `reports/dual_push_nonzero_branch_20260330/...`
  - `reports/dual_push_full_arch_hybrid_20260331/...`

Use those reports as a diagnosis of what is weak, not as proof that the current architecture is already ready.

---

## 19. External references to keep in mind

General bimanual baselines and nearby work:

- PerAct2 benchmark and baselines: https://arxiv.org/abs/2407.00278
- AnyBimanual: https://bimanual.github.io/
- 3D FlowMatch Actor (3DFA): https://arxiv.org/abs/2508.11002
- CoFreeVLA: https://arxiv.org/abs/2601.21712
- ActiveVLA: https://arxiv.org/abs/2601.08325

Target-task conceptual neighbors:

- Vision in Action (active bag reveal/retrieve from human demonstrations): https://arxiv.org/html/2506.15666v1
- Bimanual Deformable Bag Manipulation with SOI neural dynamics: https://arxiv.org/abs/2401.11432
- Occlusion-Aware Search for Object Retrieval in Clutter: https://ieeexplore.ieee.org/document/9197067
- GarmentPile++ / cluttered garment retrieval: https://arxiv.org/abs/2603.04158
- RoboTwin 2.0 benchmark: https://arxiv.org/abs/2506.18088

Add the exact GarmentLab citation separately if that benchmark is included in the final experimental plan.

---

## Final instruction to the implementer

Do not try to rescue the current architecture by adding even more structure. The repo already revealed the answer: the good idea is narrow. Keep the structured reveal-state adapter, keep the retrieve gate, keep task-aware proposals, and force the whole design to prove two things cleanly:

1. it does not break a strong trunk on general bimanual tasks,
2. it improves reveal/retrieve under elastic occlusion.

If both are true, the project is in good shape. If either is false, simplify further rather than expanding again.