pravsels
/

dit_block_tower_baseline

Model card Files Files and versions

dit_block_tower_baseline / TRAINING_LOG.md

pravsels's picture

Upload TRAINING_LOG.md with huggingface_hub

6292558 verified about 2 months ago

|

history blame contribute delete

3.18 kB

	# Block Tower Baseline v1 — Training Log

	## Mode
	- run_type: replication
	- objective: Train DiT diffusion policy on block tower task with 6-dataset DAgger mix

	## Config
	- config: `config/train_block_tower.yaml`
	- branch: `stage1-multimodal-abstraction`
	- datasets: 6-dataset mix (1 base + 5 DAgger rounds), ~341k total frames
	- key settings: batch 64/GPU (256 global), lr 2e-5 cosine, 50k steps, horizon 100, action steps 50

	## Dataset Schema

	State (16D) and action (17D) are asymmetric. Delta actions computed on the shared 16D prefix.

	```yaml
	dataset_schema:
	state:
	- key: observation.state # joint positions (7D)
	dim: 7
	- key: observation.eef_6d_pose # xyz + rpy → xyz + rot6d (6D → 9D)
	dim: 6
	convert_rotation: true
	action:
	- key: action # joint commands (7D)
	dim: 7
	- key: action.eef_pose # xyz + rpy + gripper → xyz + rot6d + gripper (7D → 10D)
	dim: 7
	convert_rotation: true
	rot6d_slice: [10, 16]
	```

	## Job History

	\| # \| Submitted \| Status \| Notes \|
	\|---\|-----------\|--------\|-------\|
	\| 1 \| ~Apr 8 \| Failed \| `KeyError: 'observation.state.pos'` — hardcoded key names didn't match block tower dataset schema \|
	\| 2 \| Apr 9 12:54 \| Cancelled \| Ramen stats stuck at ~10 it/s over 341k samples (~9.5h ETA), video decoding bottleneck \|
	\| 3 \| Apr 9 13:00 \| Cancelled \| Bulk parquet stats worked, but unauthenticated HF requests hit 429 rate limits \|
	\| 4 \| Apr 9 13:00 \| Timeout \| Clean run, trained 35k/50k steps before hitting 24h walltime \|

	## Job (run 4)
	- submitted: 2026-04-09T13:00:12Z
	- start_human: Wednesday, Apr 9th, 2026
	- end: 2026-04-10T13:00:24Z
	- end_human: Thursday, Apr 10th, 2026
	- runtime: 24:00:12 (walltime limit)
	- hardware: 1 node, 4x GH200 (Isambard AIP2)

	## Status
	- Apr 9 13:05 — running, loss ~1.12 at step 1
	- Apr 9 16:16 — checkpoint_5000 saved
	- Apr 9 19:30 — checkpoint_10000 saved
	- Apr 9 22:43 — checkpoint_15000 saved
	- Apr 10 01:56 — checkpoint_20000 saved
	- Apr 10 05:10 — checkpoint_25000 saved
	- Apr 10 08:23 — checkpoint_30000 saved
	- Apr 10 11:37 — checkpoint_35000 saved
	- Apr 10 13:00 — TIMEOUT, killed by scheduler at walltime limit

	## Results
	- final step: ~35,000 / 50,000
	- checkpoints saved: 5k, 10k, 15k, 20k, 25k, 30k, 35k
	- loss at step 1: ~1.12
	- loss_one_liner: Loss was decreasing steadily; run was cut short by walltime before completion.

	## W&B
	- synced: https://wandb.ai/pravsels/dit_block_tower/runs/pv8q64et

	## Key Changes This Session

	1. Task-based DatasetSchema: Replaced hardcoded key names with per-task YAML schema declaring keys, dimensions, and RPY→rot6d conversion. Supports asymmetric state/action dims.
	2. Bulk parquet Ramen stats: Reads numerical columns directly from `hf_dataset` (parquet), bypassing video decoding. ~440x speedup (9.5h → ~30s for 341k frames).
	3. HF token passthrough: Reads `~/.hf_token` on HPC and passes as `HF_TOKEN` env var to the container, avoiding 429 rate limits.

	## Next
	- Resume from checkpoint_35000 for remaining 15k steps
	- Evaluate checkpoints on `villekuosmanen/eval_build_block_tower_dino_test_set` (5 episodes)