Alpamayo R1 + Waymo Fine-tuning β€” 3-recipe ablation

3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1 (VLM finetune, token CE) β†’ Stage 2 (diffusion expert SFT, flow matching). All 3 epochs, 8Γ— H200, DeepSpeed ZeRO-2.

All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots [0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele). Waymo data is re-mapped to these slots; front_tele is emulated by center-cropping Waymo FRONT (~30Β°/50.4Β°FOV β‰ˆ 0.595 crop ratio).

Recipes

Recipe NV component LT component Waymo component
R-A 10k random 1450 GT (real) Waymo GT (real, 798 clips)
R-B 10k random 1450 Ours-synth Waymo Ours-synth (Ray-WAN, 800 clips)
R-C 10k random 1450 GT (real) Waymo Ours-synth (Ray-WAN, 800 clips)

Final losses (3 epochs each)

Recipe Stage 1 (token CE) Stage 2 (flow matching)
R-A 1.277 0.385
R-B 1.281 0.403
R-C 1.279 0.400

Files

Per recipe (e.g. R-A_s2/):

  • model-*-of-*.safetensors: model weights (~21 GB total)
  • model.safetensors.index.json: shard index
  • config.json: Alpamayo R1 model config
  • trainer_state.json: full training history (loss curve, lr schedule)
  • training_args.bin, scheduler.pt: trainer state

Optimizer states are NOT included (only useful for training resume).

Loading

from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2")

Training data sources

  • 10k random NV: subset of nominal NVIDIA PAI dataset (4-cam standard).
  • 1450 GT: out-of-distribution train bucket from the 10k+OOD subset.
  • 1450 Ours-synth: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views.
  • Waymo GT: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele).
  • Waymo Ours-synth: 800 clips re-encoded through Ray-WAN-on-Waymo (waymo_p65v1_arr_185143), trajectory poses from the original GT clips.

v2 ckpts β€” Waymo regen (Variant A + Qwen3 per-clip caption)

Re-ran the Waymo synth pipeline with two improvements (waymo_p65v2_va_190308):

  • Variant A: source-view K = original Waymo intrinsics (target views still PAI K), fixes the camera-pitch artifact that made the v1 synth look like it was tilted down.
  • Qwen3-VL-8B-Instruct per-clip caption: each clip gets its own caption from Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects.

Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule).

Recipe Stage 1 (token CE) Stage 2 (flow matching) Subfolder
R-B v2 1.282 0.402 R-B_v2_s2/
R-C v2 1.280 0.398 R-C_v2_s2/

Loading v2:

model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2")

R-A is unchanged (GT-only β€” no synth dependency).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support