Block Tower Baseline v1 β Training Log
Mode
- run_type: replication
- objective: Train DiT diffusion policy on block tower task with 6-dataset DAgger mix
Config
- config:
config/train_block_tower.yaml - branch:
stage1-multimodal-abstraction - datasets: 6-dataset mix (1 base + 5 DAgger rounds), ~341k total frames
- key settings: batch 64/GPU (256 global), lr 2e-5 cosine, 50k steps, horizon 100, action steps 50
Dataset Schema
State (16D) and action (17D) are asymmetric. Delta actions computed on the shared 16D prefix.
dataset_schema:
state:
- key: observation.state # joint positions (7D)
dim: 7
- key: observation.eef_6d_pose # xyz + rpy β xyz + rot6d (6D β 9D)
dim: 6
convert_rotation: true
action:
- key: action # joint commands (7D)
dim: 7
- key: action.eef_pose # xyz + rpy + gripper β xyz + rot6d + gripper (7D β 10D)
dim: 7
convert_rotation: true
rot6d_slice: [10, 16]
Job History
| # | Submitted | Status | Notes |
|---|---|---|---|
| 1 | ~Apr 8 | Failed | KeyError: 'observation.state.pos' β hardcoded key names didn't match block tower dataset schema |
| 2 | Apr 9 12:54 | Cancelled | Ramen stats stuck at |
| 3 | Apr 9 13:00 | Cancelled | Bulk parquet stats worked, but unauthenticated HF requests hit 429 rate limits |
| 4 | Apr 9 13:00 | Timeout | Clean run, trained 35k/50k steps before hitting 24h walltime |
Job (run 4)
- submitted: 2026-04-09T13:00:12Z
- start_human: Wednesday, Apr 9th, 2026
- end: 2026-04-10T13:00:24Z
- end_human: Thursday, Apr 10th, 2026
- runtime: 24:00:12 (walltime limit)
- hardware: 1 node, 4x GH200 (Isambard AIP2)
Status
- Apr 9 13:05 β running, loss ~1.12 at step 1
- Apr 9 16:16 β checkpoint_5000 saved
- Apr 9 19:30 β checkpoint_10000 saved
- Apr 9 22:43 β checkpoint_15000 saved
- Apr 10 01:56 β checkpoint_20000 saved
- Apr 10 05:10 β checkpoint_25000 saved
- Apr 10 08:23 β checkpoint_30000 saved
- Apr 10 11:37 β checkpoint_35000 saved
- Apr 10 13:00 β TIMEOUT, killed by scheduler at walltime limit
Results
- final step: ~35,000 / 50,000
- checkpoints saved: 5k, 10k, 15k, 20k, 25k, 30k, 35k
- loss at step 1: ~1.12
- loss_one_liner: Loss was decreasing steadily; run was cut short by walltime before completion.
W&B
Key Changes This Session
- Task-based DatasetSchema: Replaced hardcoded key names with per-task YAML schema declaring keys, dimensions, and RPYβrot6d conversion. Supports asymmetric state/action dims.
- Bulk parquet Ramen stats: Reads numerical columns directly from
hf_dataset(parquet), bypassing video decoding. ~440x speedup (9.5h β ~30s for 341k frames). - HF token passthrough: Reads
~/.hf_tokenon HPC and passes asHF_TOKENenv var to the container, avoiding 429 rate limits.
Next
- Resume from checkpoint_35000 for remaining 15k steps
- Evaluate checkpoints on
villekuosmanen/eval_build_block_tower_dino_test_set(5 episodes)