Reasoning-Trajectory Misalignment: Is RL-aligned checkpoint planned?

#11

by dedarrow - opened Jan 26

Jan 26

I've been extensively testing Alpamayo 1 with AlpaSim and found that the Chain-of-Causation reasoning frequently contradicts the actual trajectory output — reasoning says "nudge left to pass parked car," but the trajectory curves right, causing collisions.
Reproduced on:

DGX Spark (ARM64)
4x H100 (x86)

The GitHub FAQ confirms this release is SFT-only without RL post-training. Per the paper (arXiv:2511.00088), RL post-training improves "reasoning-action consistency by 37%" — which appears to be exactly what's missing.
Questions:

Is this expected behavior for the SFT-only release?
Is there a timeline for releasing the RL-aligned checkpoint?

Detailed findings with video evidence:
https://github.com/NVlabs/alpasim/issues/20
https://github.com/NVlabs/alpamayo/issues/38

polluxxx

NVIDIA org Feb 5

Thanks again for trying the model and the question! Just for documentation purpose for when others land here and have the same question: Please see this response from our team https://github.com/NVlabs/alpamayo/issues/38#issuecomment-3807833184

BorisIvanovic

NVIDIA org Apr 16

Closing as https://huggingface.co/nvidia/Alpamayo-1.5-10B addresses this. Thanks!

BorisIvanovic changed discussion status to closed Apr 16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment