| # Block Tower Baseline v1 β Training Log |
|
|
| ## Mode |
| - run_type: replication |
| - objective: Train DiT diffusion policy on block tower task with 6-dataset DAgger mix |
| |
| ## Config |
| - config: `config/train_block_tower.yaml` |
| - branch: `stage1-multimodal-abstraction` |
| - datasets: 6-dataset mix (1 base + 5 DAgger rounds), ~341k total frames |
| - key settings: batch 64/GPU (256 global), lr 2e-5 cosine, 50k steps, horizon 100, action steps 50 |
| |
| ## Dataset Schema |
| |
| State (16D) and action (17D) are asymmetric. Delta actions computed on the shared 16D prefix. |
| |
| ```yaml |
| dataset_schema: |
| state: |
| - key: observation.state # joint positions (7D) |
| dim: 7 |
| - key: observation.eef_6d_pose # xyz + rpy β xyz + rot6d (6D β 9D) |
| dim: 6 |
| convert_rotation: true |
| action: |
| - key: action # joint commands (7D) |
| dim: 7 |
| - key: action.eef_pose # xyz + rpy + gripper β xyz + rot6d + gripper (7D β 10D) |
| dim: 7 |
| convert_rotation: true |
| rot6d_slice: [10, 16] |
| ``` |
| |
| ## Job History |
|
|
| | # | Submitted | Status | Notes | |
| |---|-----------|--------|-------| |
| | 1 | ~Apr 8 | Failed | `KeyError: 'observation.state.pos'` β hardcoded key names didn't match block tower dataset schema | |
| | 2 | Apr 9 12:54 | Cancelled | Ramen stats stuck at ~10 it/s over 341k samples (~9.5h ETA), video decoding bottleneck | |
| | 3 | Apr 9 13:00 | Cancelled | Bulk parquet stats worked, but unauthenticated HF requests hit 429 rate limits | |
| | 4 | Apr 9 13:00 | Timeout | Clean run, trained 35k/50k steps before hitting 24h walltime | |
|
|
| ## Job (run 4) |
| - submitted: 2026-04-09T13:00:12Z |
| - start_human: Wednesday, Apr 9th, 2026 |
| - end: 2026-04-10T13:00:24Z |
| - end_human: Thursday, Apr 10th, 2026 |
| - runtime: 24:00:12 (walltime limit) |
| - hardware: 1 node, 4x GH200 (Isambard AIP2) |
|
|
| ## Status |
| - Apr 9 13:05 β running, loss ~1.12 at step 1 |
| - Apr 9 16:16 β checkpoint_5000 saved |
| - Apr 9 19:30 β checkpoint_10000 saved |
| - Apr 9 22:43 β checkpoint_15000 saved |
| - Apr 10 01:56 β checkpoint_20000 saved |
| - Apr 10 05:10 β checkpoint_25000 saved |
| - Apr 10 08:23 β checkpoint_30000 saved |
| - Apr 10 11:37 β checkpoint_35000 saved |
| - Apr 10 13:00 β TIMEOUT, killed by scheduler at walltime limit |
| |
| ## Results |
| - final step: ~35,000 / 50,000 |
| - checkpoints saved: 5k, 10k, 15k, 20k, 25k, 30k, 35k |
| - loss at step 1: ~1.12 |
| - loss_one_liner: Loss was decreasing steadily; run was cut short by walltime before completion. |
| |
| ## W&B |
| - synced: https://wandb.ai/pravsels/dit_block_tower/runs/pv8q64et |
| |
| ## Key Changes This Session |
| |
| 1. **Task-based DatasetSchema**: Replaced hardcoded key names with per-task YAML schema declaring keys, dimensions, and RPYβrot6d conversion. Supports asymmetric state/action dims. |
| 2. **Bulk parquet Ramen stats**: Reads numerical columns directly from `hf_dataset` (parquet), bypassing video decoding. ~440x speedup (9.5h β ~30s for 341k frames). |
| 3. **HF token passthrough**: Reads `~/.hf_token` on HPC and passes as `HF_TOKEN` env var to the container, avoiding 429 rate limits. |
|
|
| ## Next |
| - Resume from checkpoint_35000 for remaining 15k steps |
| - Evaluate checkpoints on `villekuosmanen/eval_build_block_tower_dino_test_set` (5 episodes) |
| |