pravsels
/

dit_block_tower_norm_fix

Model card Files Files and versions

dit_block_tower_norm_fix / TRAINING_LOG.md

pravsels's picture

Upload TRAINING_LOG.md with huggingface_hub

4367989 verified about 2 months ago

|

history blame contribute delete

1.53 kB

Training Log — Block Tower Norm Fix

Overview

run_type: replication
objective: retrain block tower from scratch with per-timestep (H,D) RAMEN action stats and semantic cleanup (action chunk starts at current action)

Config

config: config/train_block_tower.yaml
dataset: villekuosmanen/build_block_tower + DAgger rounds 1.0.0–1.4.0
key settings: batch_size=80 per GPU (320 global), train_steps=50000, optimizer_lr=3e-4, warmup=500, save_freq=1000, keep_freq=5000, num_workers=8, prefetch_factor=2, horizon=32, n_action_steps=32, DDIM, resize_shape=[224,224], crop_shape=null
what changed vs prior run:
- compute_ramen_stats now emits (H=32, D=17) action stats instead of (1, 17)
- action chunk semantic cleanup: slot 0 = act[t] - obs[t] (first executable action), no look-back prefix
- config consolidated from train_block_tower_bs320_lr3e4.yaml into train_block_tower.yaml
- fresh training from step 0 (old checkpoints semantically incompatible)

Training

hardware: 4x GH200 GPUs (1 node)
start: 2026-04-17 17:48 UTC
end: 2026-04-18 17:48 UTC (walltime limit)
runtime: 1 day 0h 0m 29s

Results

final step: ~29588/50000
start_train_loss: 1.04
end_train_loss: 0.0047
loss_one_liner: Loss dropped steadily from 1.04 to 0.0047 over ~29.5k steps; healthy progression, no sign of plateau or overfitting.

W&B

Training dashboard

Next

resume from checkpoint_29000 to complete remaining ~21k steps