EmptyDrum/stack-cubes-rlt-token-v1
RL-token autoencoder (RLT Phase 1) for the stack-cubes task. A learned
<rl> token + transformer encoder/decoder trained to reconstruct the prefix
hidden states of a frozen pi05 policy
(johannesmichalke/stack-cubes-pi05-v1 @ step 030000), so the encoder distills
the VLA's internal state into a single vector z_rl.
- Trainable: encoder + decoder +
<rl>token (~409M params) - Frozen: the pi05 backbone (not included here — reload from the policy repo)
- Demos:
johannesmichalke/stack-cubes-split - Objective: masked reconstruction MSE on the prefix embeddings
Layout
Each NNNNNN/token_ae.pt is a checkpoint at that training step
(trainable_state_dict = encoder/decoder/<rl> weights only) with a sibling
config.json describing the token config + training config.
Diagnostics (calibrated)
What holds up:
- Task phase / time is encoded clearly — t-SNE of z_rl shows a smooth start→end gradient, and reconstruction MSE is low. The AE is trained to reconstruct the prefix embeddings, so it captures whatever dominates their variance (visual/task state).
- Occlusion saliency shows z_rl reads from the cube / manipulation region.
What does NOT hold up (do not overclaim):
- z_rl is a weak outcome predictor. A frame-level linear probe for
success/failure reaches only
0.66 AUC (0.70 acc vs a 0.75 majority baseline). Outcome is only weakly/late decodable (it rises over an episode), and success-vs-failure z_rl trajectories are not visibly smoother/ cleaner than each other — contrary to the original RL-token blog's claim on its task. This is expected: the objective is reconstruction, not outcome prediction.
Trained with the cached-prefix + bf16 path in Server-YAM.