Add v2 section for R-B/R-C retrained on Variant A + Qwen3 Waymo synth

1f4f40d verified 9 days ago

3.13 kB

	---
	license: apache-2.0
	tags:
	- autonomous-driving
	- alpamayo
	- waymo
	- long-tail
	---
	# Alpamayo R1 + Waymo Fine-tuning — 3-recipe ablation

	3 SFT recipes finetuned on Alpamayo R1 base, each going through Stage 1
	(VLM finetune, token CE) → Stage 2 (diffusion expert SFT, flow matching).
	All 3 epochs, 8× H200, DeepSpeed ZeRO-2.

	All recipes use the same camera scheme aligned to NVIDIA's 4-cam slots
	[0, 1, 2, 6] = (cross_left, front_wide, cross_right, front_tele).
	Waymo data is re-mapped to these slots; front_tele is emulated by
	center-cropping Waymo FRONT (~30°/50.4°FOV ≈ 0.595 crop ratio).

	## Recipes

	\| Recipe \| NV component \| LT component \| Waymo component \|
	\|---\|---\|---\|---\|
	\| R-A \| 10k random \| 1450 GT (real) \| Waymo GT (real, 798 clips) \|
	\| R-B \| 10k random \| 1450 Ours-synth \| Waymo Ours-synth (Ray-WAN, 800 clips) \|
	\| R-C \| 10k random \| 1450 GT (real) \| Waymo Ours-synth (Ray-WAN, 800 clips) \|

	## Final losses (3 epochs each)

	\| Recipe \| Stage 1 (token CE) \| Stage 2 (flow matching) \|
	\|---\|---\|---\|
	\| R-A \| 1.277 \| 0.385 \|
	\| R-B \| 1.281 \| 0.403 \|
	\| R-C \| 1.279 \| 0.400 \|

	## Files

	Per recipe (e.g. `R-A_s2/`):
	- `model--of-.safetensors`: model weights (~21 GB total)
	- `model.safetensors.index.json`: shard index
	- `config.json`: Alpamayo R1 model config
	- `trainer_state.json`: full training history (loss curve, lr schedule)
	- `training_args.bin`, `scheduler.pt`: trainer state

	Optimizer states are NOT included (only useful for training resume).

	## Loading

	```python
	from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
	model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-A_s2")
	```

	## Training data sources

	- 10k random NV: subset of nominal NVIDIA PAI dataset (4-cam standard).
	- 1450 GT: out-of-distribution train bucket from the 10k+OOD subset.
	- 1450 Ours-synth: 603 of the 1450 NV clips replaced by Ray-WAN synthesised views.
	- Waymo GT: 798 Waymo training clips, FRONT_LEFT/FRONT/FRONT_RIGHT
	resampled to NV slots [0, 1, 2], with FRONT center-cropped to slot 6 (tele).
	- Waymo Ours-synth: 800 clips re-encoded through Ray-WAN-on-Waymo
	(`waymo_p65v1_arr_185143`), trajectory poses from the original GT clips.

	## v2 ckpts — Waymo regen (Variant A + Qwen3 per-clip caption)

	Re-ran the Waymo synth pipeline with two improvements (`waymo_p65v2_va_190308`):
	- Variant A: source-view K = original Waymo intrinsics (target views still PAI K),
	fixes the camera-pitch artifact that made the v1 synth look like it was tilted down.
	- Qwen3-VL-8B-Instruct per-clip caption: each clip gets its own caption from
	Qwen3-VL rather than borrowing a PAI prompt that mentions unrelated objects.

	Then re-trained R-B and R-C on this new synth (3 epochs each, identical schedule).

	\| Recipe \| Stage 1 (token CE) \| Stage 2 (flow matching) \| Subfolder \|
	\|---\|---\|---\|---\|
	\| R-B v2 \| 1.282 \| 0.402 \| `R-B_v2_s2/` \|
	\| R-C v2 \| 1.280 \| 0.398 \| `R-C_v2_s2/` \|

	Loading v2:
	```python
	model = AlpamayoR1.from_pretrained("luuuulinnnn/alpamyo_Waymo", subfolder="R-B_v2_s2")
	```

	R-A is unchanged (GT-only — no synth dependency).