dit_block_tower_baseline / TRAINING_LOG.md
pravsels's picture
Upload TRAINING_LOG.md with huggingface_hub
6292558 verified

Block Tower Baseline v1 β€” Training Log

Mode

  • run_type: replication
  • objective: Train DiT diffusion policy on block tower task with 6-dataset DAgger mix

Config

  • config: config/train_block_tower.yaml
  • branch: stage1-multimodal-abstraction
  • datasets: 6-dataset mix (1 base + 5 DAgger rounds), ~341k total frames
  • key settings: batch 64/GPU (256 global), lr 2e-5 cosine, 50k steps, horizon 100, action steps 50

Dataset Schema

State (16D) and action (17D) are asymmetric. Delta actions computed on the shared 16D prefix.

dataset_schema:
  state:
    - key: observation.state          # joint positions (7D)
      dim: 7
    - key: observation.eef_6d_pose    # xyz + rpy β†’ xyz + rot6d (6D β†’ 9D)
      dim: 6
      convert_rotation: true
  action:
    - key: action                     # joint commands (7D)
      dim: 7
    - key: action.eef_pose            # xyz + rpy + gripper β†’ xyz + rot6d + gripper (7D β†’ 10D)
      dim: 7
      convert_rotation: true
  rot6d_slice: [10, 16]

Job History

# Submitted Status Notes
1 ~Apr 8 Failed KeyError: 'observation.state.pos' β€” hardcoded key names didn't match block tower dataset schema
2 Apr 9 12:54 Cancelled Ramen stats stuck at 10 it/s over 341k samples (9.5h ETA), video decoding bottleneck
3 Apr 9 13:00 Cancelled Bulk parquet stats worked, but unauthenticated HF requests hit 429 rate limits
4 Apr 9 13:00 Timeout Clean run, trained 35k/50k steps before hitting 24h walltime

Job (run 4)

  • submitted: 2026-04-09T13:00:12Z
  • start_human: Wednesday, Apr 9th, 2026
  • end: 2026-04-10T13:00:24Z
  • end_human: Thursday, Apr 10th, 2026
  • runtime: 24:00:12 (walltime limit)
  • hardware: 1 node, 4x GH200 (Isambard AIP2)

Status

  • Apr 9 13:05 β€” running, loss ~1.12 at step 1
  • Apr 9 16:16 β€” checkpoint_5000 saved
  • Apr 9 19:30 β€” checkpoint_10000 saved
  • Apr 9 22:43 β€” checkpoint_15000 saved
  • Apr 10 01:56 β€” checkpoint_20000 saved
  • Apr 10 05:10 β€” checkpoint_25000 saved
  • Apr 10 08:23 β€” checkpoint_30000 saved
  • Apr 10 11:37 β€” checkpoint_35000 saved
  • Apr 10 13:00 β€” TIMEOUT, killed by scheduler at walltime limit

Results

  • final step: ~35,000 / 50,000
  • checkpoints saved: 5k, 10k, 15k, 20k, 25k, 30k, 35k
  • loss at step 1: ~1.12
  • loss_one_liner: Loss was decreasing steadily; run was cut short by walltime before completion.

W&B

Key Changes This Session

  1. Task-based DatasetSchema: Replaced hardcoded key names with per-task YAML schema declaring keys, dimensions, and RPY→rot6d conversion. Supports asymmetric state/action dims.
  2. Bulk parquet Ramen stats: Reads numerical columns directly from hf_dataset (parquet), bypassing video decoding. ~440x speedup (9.5h β†’ ~30s for 341k frames).
  3. HF token passthrough: Reads ~/.hf_token on HPC and passes as HF_TOKEN env var to the container, avoiding 429 rate limits.

Next

  • Resume from checkpoint_35000 for remaining 15k steps
  • Evaluate checkpoints on villekuosmanen/eval_build_block_tower_dino_test_set (5 episodes)