| # Training Log — Block Tower Norm Fix |
|
|
| ## Overview |
| - run_type: replication |
| - objective: retrain block tower from scratch with per-timestep (H,D) RAMEN action stats and semantic cleanup (action chunk starts at current action) |
| |
| ## Config |
| - config: `config/train_block_tower.yaml` |
| - dataset: `villekuosmanen/build_block_tower` + DAgger rounds 1.0.0–1.4.0 |
| - key settings: batch_size=80 per GPU (320 global), train_steps=50000, optimizer_lr=3e-4, warmup=500, save_freq=1000, keep_freq=5000, num_workers=8, prefetch_factor=2, horizon=32, n_action_steps=32, DDIM, resize_shape=[224,224], crop_shape=null |
| - what changed vs prior run: |
| - `compute_ramen_stats` now emits (H=32, D=17) action stats instead of (1, 17) |
| - action chunk semantic cleanup: slot 0 = act[t] - obs[t] (first executable action), no look-back prefix |
| - config consolidated from `train_block_tower_bs320_lr3e4.yaml` into `train_block_tower.yaml` |
| - fresh training from step 0 (old checkpoints semantically incompatible) |
|
|
| ## Training |
| - hardware: 4x GH200 GPUs (1 node) |
| - start: 2026-04-17 17:48 UTC |
| - end: 2026-04-18 17:48 UTC (walltime limit) |
| - runtime: 1 day 0h 0m 29s |
|
|
| ## Results |
| - final step: ~29588/50000 |
| - start_train_loss: 1.04 |
| - end_train_loss: 0.0047 |
| - loss_one_liner: Loss dropped steadily from 1.04 to 0.0047 over ~29.5k steps; healthy progression, no sign of plateau or overfitting. |
|
|
| ## W\&B |
| - [Training dashboard](https://wandb.ai/pravsels/dit_block_tower_norm_fix/runs/ksuxe451) |
|
|
| ## Next |
| - resume from checkpoint_29000 to complete remaining ~21k steps |
| |