--- license: apache-2.0 tags: - autonomous-driving - alpamayo - waymo - long-tail --- # Alpamayo R1 + Waymo Fine-tuning — 3-recipe ablation 3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1 (VLM finetune, token CE) → Stage 2 (diffusion expert SFT, flow matching). All 3 epochs, 8× H200, DeepSpeed ZeRO-2. All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots [0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele). Waymo data is re-mapped to these slots; front_tele is emulated by center-cropping Waymo FRONT (~30°/50.4°FOV ≈ 0.595 crop ratio). ## Recipes | Recipe | NV component | LT component | Waymo component | |---|---|---|---| | **R-A** | 10k random | 1450 GT (real) | Waymo GT (real, 798 clips) | | **R-B** | 10k random | 1450 Ours-synth | Waymo Ours-synth (Ray-WAN, 800 clips) | | **R-C** | 10k random | 1450 GT (real) | Waymo Ours-synth (Ray-WAN, 800 clips) | ## Final losses (3 epochs each) | Recipe | Stage 1 (token CE) | Stage 2 (flow matching) | |---|---|---| | R-A | 1.277 | 0.385 | | R-B | 1.281 | 0.403 | | R-C | 1.279 | 0.400 | ## Files Per recipe (e.g. `R-A_s2/`): - `model-*-of-*.safetensors`: model weights (~21 GB total) - `model.safetensors.index.json`: shard index - `config.json`: Alpamayo R1 model config - `trainer_state.json`: full training history (loss curve, lr schedule) - `training_args.bin`, `scheduler.pt`: trainer state Optimizer states are NOT included (only useful for training resume). ## Loading ```python from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1 model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2") ``` ## Training data sources - **10k random NV**: subset of nominal NVIDIA PAI dataset (4-cam standard). - **1450 GT**: out-of-distribution train bucket from the 10k+OOD subset. - **1450 Ours-synth**: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views. - **Waymo GT**: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele). - **Waymo Ours-synth**: 800 clips re-encoded through Ray-WAN-on-Waymo (`waymo_p65v1_arr_185143`), trajectory poses from the original GT clips. ## v2 ckpts — Waymo regen (Variant A + Qwen3 per-clip caption) Re-ran the Waymo synth pipeline with two improvements (`waymo_p65v2_va_190308`): - **Variant A**: source-view K = original Waymo intrinsics (target views still PAI K), fixes the camera-pitch artifact that made the v1 synth look like it was tilted down. - **Qwen3-VL-8B-Instruct per-clip caption**: each clip gets its own caption from Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects. Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule). | Recipe | Stage 1 (token CE) | Stage 2 (flow matching) | Subfolder | |---|---|---|---| | R-B v2 | 1.282 | 0.402 | `R-B_v2_s2/` | | R-C v2 | 1.280 | 0.398 | `R-C_v2_s2/` | Loading v2: ```python model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2") ``` R-A is unchanged (GT-only — no synth dependency).