alpamyo_Waymo / README.md
luuuulinnnn's picture
Add v2 section for R-B/R-C retrained on Variant A + Qwen3 Waymo synth
1f4f40d verified
---
license: apache-2.0
tags:
- autonomous-driving
- alpamayo
- waymo
- long-tail
---
# Alpamayo R1 + Waymo Fine-tuning — 3-recipe ablation
3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1
(VLM finetune, token CE) → Stage 2 (diffusion expert SFT, flow matching).
All 3 epochs, 8× H200, DeepSpeed ZeRO-2.
All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots
[0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele).
Waymo data is re-mapped to these slots; front_tele is emulated by
center-cropping Waymo FRONT (~30°/50.4°FOV ≈ 0.595 crop ratio).
## Recipes
| Recipe | NV component | LT component | Waymo component |
|---|---|---|---|
| **R-A** | 10k random | 1450 GT (real) | Waymo GT (real, 798 clips) |
| **R-B** | 10k random | 1450 Ours-synth | Waymo Ours-synth (Ray-WAN, 800 clips) |
| **R-C** | 10k random | 1450 GT (real) | Waymo Ours-synth (Ray-WAN, 800 clips) |
## Final losses (3 epochs each)
| Recipe | Stage 1 (token CE) | Stage 2 (flow matching) |
|---|---|---|
| R-A | 1.277 | 0.385 |
| R-B | 1.281 | 0.403 |
| R-C | 1.279 | 0.400 |
## Files
Per recipe (e.g. `R-A_s2/`):
- `model-*-of-*.safetensors`: model weights (~21 GB total)
- `model.safetensors.index.json`: shard index
- `config.json`: Alpamayo R1 model config
- `trainer_state.json`: full training history (loss curve, lr schedule)
- `training_args.bin`, `scheduler.pt`: trainer state
Optimizer states are NOT included (only useful for training resume).
## Loading
```python
from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2")
```
## Training data sources
- **10k random NV**: subset of nominal NVIDIA PAI dataset (4-cam standard).
- **1450 GT**: out-of-distribution train bucket from the 10k+OOD subset.
- **1450 Ours-synth**: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views.
- **Waymo GT**: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT
resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele).
- **Waymo Ours-synth**: 800 clips re-encoded through Ray-WAN-on-Waymo
(`waymo_p65v1_arr_185143`), trajectory poses from the original GT clips.
## v2 ckpts — Waymo regen (Variant A + Qwen3 per-clip caption)
Re-ran the Waymo synth pipeline with two improvements (`waymo_p65v2_va_190308`):
- **Variant A**: source-view K = original Waymo intrinsics (target views still PAI K),
fixes the camera-pitch artifact that made the v1 synth look like it was tilted down.
- **Qwen3-VL-8B-Instruct per-clip caption**: each clip gets its own caption from
Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects.
Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule).
| Recipe | Stage 1 (token CE) | Stage 2 (flow matching) | Subfolder |
|---|---|---|---|
| R-B v2 | 1.282 | 0.402 | `R-B_v2_s2/` |
| R-C v2 | 1.280 | 0.398 | `R-C_v2_s2/` |
Loading v2:
```python
model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2")
```
R-A is unchanged (GT-only — no synth dependency).