| --- |
| license: apache-2.0 |
| tags: |
| - autonomous-driving |
| - alpamayo |
| - waymo |
| - long-tail |
| --- |
| # Alpamayo R1 + Waymo Fine-tuning — 3-recipe ablation |
|
|
| 3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1 |
| (VLM finetune, token CE) → Stage 2 (diffusion expert SFT, flow matching). |
| All 3 epochs, 8× H200, DeepSpeed ZeRO-2. |
|
|
| All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots |
| [0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele). |
| Waymo data is re-mapped to these slots; front_tele is emulated by |
| center-cropping Waymo FRONT (~30°/50.4°FOV ≈ 0.595 crop ratio). |
| |
| ## Recipes |
| |
| | Recipe | NV component | LT component | Waymo component | |
| |---|---|---|---| |
| | **R-A** | 10k random | 1450 GT (real) | Waymo GT (real, 798 clips) | |
| | **R-B** | 10k random | 1450 Ours-synth | Waymo Ours-synth (Ray-WAN, 800 clips) | |
| | **R-C** | 10k random | 1450 GT (real) | Waymo Ours-synth (Ray-WAN, 800 clips) | |
| |
| ## Final losses (3 epochs each) |
| |
| | Recipe | Stage 1 (token CE) | Stage 2 (flow matching) | |
| |---|---|---| |
| | R-A | 1.277 | 0.385 | |
| | R-B | 1.281 | 0.403 | |
| | R-C | 1.279 | 0.400 | |
| |
| ## Files |
| |
| Per recipe (e.g. `R-A_s2/`): |
| - `model-*-of-*.safetensors`: model weights (~21 GB total) |
| - `model.safetensors.index.json`: shard index |
| - `config.json`: Alpamayo R1 model config |
| - `trainer_state.json`: full training history (loss curve, lr schedule) |
| - `training_args.bin`, `scheduler.pt`: trainer state |
|
|
| Optimizer states are NOT included (only useful for training resume). |
|
|
| ## Loading |
|
|
| ```python |
| from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1 |
| model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2") |
| ``` |
|
|
| ## Training data sources |
|
|
| - **10k random NV**: subset of nominal NVIDIA PAI dataset (4-cam standard). |
| - **1450 GT**: out-of-distribution train bucket from the 10k+OOD subset. |
| - **1450 Ours-synth**: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views. |
| - **Waymo GT**: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT |
| resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele). |
| - **Waymo Ours-synth**: 800 clips re-encoded through Ray-WAN-on-Waymo |
| (`waymo_p65v1_arr_185143`), trajectory poses from the original GT clips. |
|
|
| ## v2 ckpts — Waymo regen (Variant A + Qwen3 per-clip caption) |
|
|
| Re-ran the Waymo synth pipeline with two improvements (`waymo_p65v2_va_190308`): |
| - **Variant A**: source-view K = original Waymo intrinsics (target views still PAI K), |
| fixes the camera-pitch artifact that made the v1 synth look like it was tilted down. |
| - **Qwen3-VL-8B-Instruct per-clip caption**: each clip gets its own caption from |
| Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects. |
|
|
| Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule). |
|
|
| | Recipe | Stage 1 (token CE) | Stage 2 (flow matching) | Subfolder | |
| |---|---|---|---| |
| | R-B v2 | 1.282 | 0.402 | `R-B_v2_s2/` | |
| | R-C v2 | 1.280 | 0.398 | `R-C_v2_s2/` | |
|
|
| Loading v2: |
| ```python |
| model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2") |
| ``` |
|
|
| R-A is unchanged (GT-only — no synth dependency). |
|
|