Alpamayo R1 + Waymo Fine-tuning — 3-recipe ablation

3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1 (VLM finetune, token CE) → Stage 2 (diffusion expert SFT, flow matching). All 3 epochs, 8× H200, DeepSpeed ZeRO-2.

All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots [0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele). Waymo data is re-mapped to these slots; front_tele is emulated by center-cropping Waymo FRONT (~30°/50.4°FOV ≈ 0.595 crop ratio).

Recipes

Recipe	NV component	LT component	Waymo component
R-A	10k random	1450 GT (real)	Waymo GT (real, 798 clips)
R-B	10k random	1450 Ours-synth	Waymo Ours-synth (Ray-WAN, 800 clips)
R-C	10k random	1450 GT (real)	Waymo Ours-synth (Ray-WAN, 800 clips)

Final losses (3 epochs each)

Recipe	Stage 1 (token CE)	Stage 2 (flow matching)
R-A	1.277	0.385
R-B	1.281	0.403
R-C	1.279	0.400

Files

Per recipe (e.g. R-A_s2/):

model-*-of-*.safetensors: model weights (~21 GB total)
model.safetensors.index.json: shard index
config.json: Alpamayo R1 model config
trainer_state.json: full training history (loss curve, lr schedule)
training_args.bin, scheduler.pt: trainer state

Optimizer states are NOT included (only useful for training resume).

Loading

from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2")

Training data sources

10k random NV: subset of nominal NVIDIA PAI dataset (4-cam standard).
1450 GT: out-of-distribution train bucket from the 10k+OOD subset.
1450 Ours-synth: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views.
Waymo GT: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele).
Waymo Ours-synth: 800 clips re-encoded through Ray-WAN-on-Waymo (waymo_p65v1_arr_185143), trajectory poses from the original GT clips.

v2 ckpts — Waymo regen (Variant A + Qwen3 per-clip caption)

Re-ran the Waymo synth pipeline with two improvements (waymo_p65v2_va_190308):

Variant A: source-view K = original Waymo intrinsics (target views still PAI K), fixes the camera-pitch artifact that made the v1 synth look like it was tilted down.
Qwen3-VL-8B-Instruct per-clip caption: each clip gets its own caption from Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects.

Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule).

Recipe	Stage 1 (token CE)	Stage 2 (flow matching)	Subfolder
R-B v2	1.282	0.402	`R-B_v2_s2/`
R-C v2	1.280	0.398	`R-C_v2_s2/`

Loading v2:

model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2")

R-A is unchanged (GT-only — no synth dependency).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support