Robotics
Transformers
Safetensors
alpamayo_r1

Reasoning-Trajectory Misalignment: Is RL-aligned checkpoint planned?

#11
by dedarrow - opened

I've been extensively testing Alpamayo 1 with AlpaSim and found that the Chain-of-Causation reasoning frequently contradicts the actual trajectory output β€” reasoning says "nudge left to pass parked car," but the trajectory curves right, causing collisions.
Reproduced on:

DGX Spark (ARM64)
4x H100 (x86)

The GitHub FAQ confirms this release is SFT-only without RL post-training. Per the paper (arXiv:2511.00088), RL post-training improves "reasoning-action consistency by 37%" β€” which appears to be exactly what's missing.
Questions:

  1. Is this expected behavior for the SFT-only release?
  2. Is there a timeline for releasing the RL-aligned checkpoint?

Detailed findings with video evidence:
https://github.com/NVlabs/alpasim/issues/20
https://github.com/NVlabs/alpamayo/issues/38

Sign up or log in to comment