| # DiT Block Tower Baseline v1 |
|
|
| Diffusion Transformer policy for the **build block tower** task, trained on 6 datasets (1 base + 5 DAgger rounds, ~341k human-control frames). |
|
|
| **Status:** Partial run — 35,000 / 50,000 steps completed (hit 24h walltime). Loss was still decreasing at cutoff. |
|
|
| ## Model |
|
|
| | | | |
| |---|---| |
| | Architecture | Diffusion Transformer (DiT) | |
| | Vision encoder | CLIP ViT-B/16 (per-camera, lr_mult=0.1) | |
| | Text encoder | CLIP ViT-B/16 | |
| | Transformer | 512 hidden, 6 layers, 8 heads | |
| | Diffusion | DDPM, 100 steps, squaredcos_cap_v2 | |
| | State dim | 16 (7 joint pos + 9 eef rot6d) | |
| | Action dim | 17 (7 joint cmd + 9 eef rot6d + 1 gripper) | |
| | Cameras | front (480x640), wrist (480x640) | |
| |
| ## Training |
| |
| | Parameter | Value | |
| |-----------|-------| |
| | Batch size | 64 per GPU (256 global, 4x GH200) | |
| | Train steps | 50,000 (35,000 completed) | |
| | Learning rate | 2e-5, cosine schedule | |
| | Warmup | 500 steps | |
| | Horizon | 100 | |
| | Action steps | 50 | |
| | Obs steps | 2 | |
| | AMP | enabled | |
| |
| ## Datasets |
| |
| | Dataset | Role | |
| |---------|------| |
| | `villekuosmanen/build_block_tower` | Base demonstrations | |
| | `villekuosmanen/dAgger_build_block_tower_1.0.0` | DAgger round 1 | |
| | `villekuosmanen/dAgger_build_block_tower_1.1.0` | DAgger round 2 | |
| | `villekuosmanen/dAgger_build_block_tower_1.2.0` | DAgger round 3 | |
| | `villekuosmanen/dAgger_build_block_tower_1.3.0` | DAgger round 4 | |
| | `villekuosmanen/dAgger_build_block_tower_1.4.0` | DAgger round 5 | |
| |
| DAgger policy frames filtered out via `ControlModePlugin` (only human-control frames used). |
| |
| ## Files |
| |
| ``` |
| README.md |
| TRAINING_LOG.md |
| assets/ |
| ramen_stats.pt # Normalization statistics |
| valid_indices.json # Per-dataset valid frame indices after DAgger filtering |
| checkpoints/ |
| 35000/ |
| model.safetensors # Model weights (inference + fine-tuning) |
| config.json # Resolved model config |
| ``` |
| |
| ## Checkpoint Integrity |
|
|
| ``` |
| sha256 (checkpoint files): |
| 6192188a config.json |
| 8f00265f model.safetensors |
| df43463f ramen_stats.pt |
| ``` |
|
|
| Full hashes: |
| ``` |
| 6192188a6a705cb6ab1632234a1b4724935d42b311c1d01fff16b0eee5c00e4a config.json |
| 8f00265f043db4bf520441bf8eec07b6ccdcbff41f6db7a4852dea25218d2ac0 model.safetensors |
| df43463ff96e90b952fb3e7bc971cd7c584308acfab82ba29d0560318e2b9d2d ramen_stats.pt |
| ``` |
|
|
| Reproduce with: |
| ```bash |
| cd checkpoints/35000 && sha256sum config.json model.safetensors |
| cd assets && sha256sum ramen_stats.pt |
| ``` |
|
|
| ## W&B |
|
|
| Training curves: https://wandb.ai/pravsels/dit_block_tower/runs/pv8q64et |
|
|
| ## Usage |
|
|
| This checkpoint is from the [multitask_dit_policy](https://github.com/pravsels/multitask_dit_policy) repo, branch `stage1-multimodal-abstraction`. |
|
|