EmptyDrum/stack-cubes-rlt-token-v1

RL-token autoencoder (RLT Phase 1) for the stack-cubes task. A learned <rl> token + transformer encoder/decoder trained to reconstruct the prefix hidden states of a frozen pi05 policy (johannesmichalke/stack-cubes-pi05-v1 @ step 030000), so the encoder distills the VLA's internal state into a single vector z_rl.

Trainable: encoder + decoder + <rl> token (~409M params)
Frozen: the pi05 backbone (not included here — reload from the policy repo)
Demos: johannesmichalke/stack-cubes-split
Objective: masked reconstruction MSE on the prefix embeddings

Layout

Each NNNNNN/token_ae.pt is a checkpoint at that training step (trainable_state_dict = encoder/decoder/<rl> weights only) with a sibling config.json describing the token config + training config.

Diagnostics (calibrated)

What holds up:

Task phase / time is encoded clearly — t-SNE of z_rl shows a smooth start→end gradient, and reconstruction MSE is low. The AE is trained to reconstruct the prefix embeddings, so it captures whatever dominates their variance (visual/task state).
Occlusion saliency shows z_rl reads from the cube / manipulation region.

What does NOT hold up (do not overclaim):

z_rl is a weak outcome predictor. A frame-level linear probe for success/failure reaches only ~~0.66 AUC (~~0.70 acc vs a 0.75 majority baseline). Outcome is only weakly/late decodable (it rises over an episode), and success-vs-failure z_rl trajectories are not visibly smoother/ cleaner than each other — contrary to the original RL-token blog's claim on its task. This is expected: the objective is reconstruction, not outcome prediction.

Trained with the cached-prefix + bf16 path in Server-YAM.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics