File size: 2,659 Bytes
fa84d81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67117d5
fa84d81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3139508
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# DiT Block Tower Baseline v1

Diffusion Transformer policy for the **build block tower** task, trained on 6 datasets (1 base + 5 DAgger rounds, ~341k human-control frames).

**Status:** Partial run — 35,000 / 50,000 steps completed (hit 24h walltime). Loss was still decreasing at cutoff.

## Model

| | |
|---|---|
| Architecture | Diffusion Transformer (DiT) |
| Vision encoder | CLIP ViT-B/16 (per-camera, lr_mult=0.1) |
| Text encoder | CLIP ViT-B/16 |
| Transformer | 512 hidden, 6 layers, 8 heads |
| Diffusion | DDPM, 100 steps, squaredcos_cap_v2 |
| State dim | 16 (7 joint pos + 9 eef rot6d) |
| Action dim | 17 (7 joint cmd + 9 eef rot6d + 1 gripper) |
| Cameras | front (480x640), wrist (480x640) |

## Training

| Parameter | Value |
|-----------|-------|
| Batch size | 64 per GPU (256 global, 4x GH200) |
| Train steps | 50,000 (35,000 completed) |
| Learning rate | 2e-5, cosine schedule |
| Warmup | 500 steps |
| Horizon | 100 |
| Action steps | 50 |
| Obs steps | 2 |
| AMP | enabled |

## Datasets

| Dataset | Role |
|---------|------|
| `villekuosmanen/build_block_tower` | Base demonstrations |
| `villekuosmanen/dAgger_build_block_tower_1.0.0` | DAgger round 1 |
| `villekuosmanen/dAgger_build_block_tower_1.1.0` | DAgger round 2 |
| `villekuosmanen/dAgger_build_block_tower_1.2.0` | DAgger round 3 |
| `villekuosmanen/dAgger_build_block_tower_1.3.0` | DAgger round 4 |
| `villekuosmanen/dAgger_build_block_tower_1.4.0` | DAgger round 5 |

DAgger policy frames filtered out via `ControlModePlugin` (only human-control frames used).

## Files

```
README.md
TRAINING_LOG.md
assets/
  ramen_stats.pt          # Normalization statistics
  valid_indices.json      # Per-dataset valid frame indices after DAgger filtering
checkpoints/
  35000/
    model.safetensors     # Model weights (inference + fine-tuning)
    config.json           # Resolved model config
```

## Checkpoint Integrity

```
sha256 (checkpoint files):
6192188a  config.json
8f00265f  model.safetensors
df43463f  ramen_stats.pt
```

Full hashes:
```
6192188a6a705cb6ab1632234a1b4724935d42b311c1d01fff16b0eee5c00e4a  config.json
8f00265f043db4bf520441bf8eec07b6ccdcbff41f6db7a4852dea25218d2ac0  model.safetensors
df43463ff96e90b952fb3e7bc971cd7c584308acfab82ba29d0560318e2b9d2d  ramen_stats.pt
```

Reproduce with:
```bash
cd checkpoints/35000 && sha256sum config.json model.safetensors
cd assets && sha256sum ramen_stats.pt
```

## W&B

Training curves: https://wandb.ai/pravsels/dit_block_tower/runs/pv8q64et

## Usage

This checkpoint is from the [multitask_dit_policy](https://github.com/pravsels/multitask_dit_policy) repo, branch `stage1-multimodal-abstraction`.