Alpamayo R1 + Waymo Fine-tuning β 3-recipe ablation
3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1 (VLM finetune, token CE) β Stage 2 (diffusion expert SFT, flow matching). All 3 epochs, 8Γ H200, DeepSpeed ZeRO-2.
All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots [0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele). Waymo data is re-mapped to these slots; front_tele is emulated by center-cropping Waymo FRONT (~30Β°/50.4Β°FOV β 0.595 crop ratio).
Recipes
| Recipe | NV component | LT component | Waymo component |
|---|---|---|---|
| R-A | 10k random | 1450 GT (real) | Waymo GT (real, 798 clips) |
| R-B | 10k random | 1450 Ours-synth | Waymo Ours-synth (Ray-WAN, 800 clips) |
| R-C | 10k random | 1450 GT (real) | Waymo Ours-synth (Ray-WAN, 800 clips) |
Final losses (3 epochs each)
| Recipe | Stage 1 (token CE) | Stage 2 (flow matching) |
|---|---|---|
| R-A | 1.277 | 0.385 |
| R-B | 1.281 | 0.403 |
| R-C | 1.279 | 0.400 |
Files
Per recipe (e.g. R-A_s2/):
model-*-of-*.safetensors: model weights (~21 GB total)model.safetensors.index.json: shard indexconfig.json: Alpamayo R1 model configtrainer_state.json: full training history (loss curve, lr schedule)training_args.bin,scheduler.pt: trainer state
Optimizer states are NOT included (only useful for training resume).
Loading
from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2")
Training data sources
- 10k random NV: subset of nominal NVIDIA PAI dataset (4-cam standard).
- 1450 GT: out-of-distribution train bucket from the 10k+OOD subset.
- 1450 Ours-synth: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views.
- Waymo GT: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele).
- Waymo Ours-synth: 800 clips re-encoded through Ray-WAN-on-Waymo
(
waymo_p65v1_arr_185143), trajectory poses from the original GT clips.
v2 ckpts β Waymo regen (Variant A + Qwen3 per-clip caption)
Re-ran the Waymo synth pipeline with two improvements (waymo_p65v2_va_190308):
- Variant A: source-view K = original Waymo intrinsics (target views still PAI K), fixes the camera-pitch artifact that made the v1 synth look like it was tilted down.
- Qwen3-VL-8B-Instruct per-clip caption: each clip gets its own caption from Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects.
Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule).
| Recipe | Stage 1 (token CE) | Stage 2 (flow matching) | Subfolder |
|---|---|---|---|
| R-B v2 | 1.282 | 0.402 | R-B_v2_s2/ |
| R-C v2 | 1.280 | 0.398 | R-C_v2_s2/ |
Loading v2:
model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2")
R-A is unchanged (GT-only β no synth dependency).