Upload REPORT.md with huggingface_hub
Browse files
REPORT.md
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training Report: 2x2 Rubik's Cube Solver via Imitation Learning
|
| 2 |
+
|
| 3 |
+
## Key Facts
|
| 4 |
+
|
| 5 |
+
| Property | Value |
|
| 6 |
+
|---|---|
|
| 7 |
+
| Model | Transformer (GPT-style), D=8, dim=512, 8 heads, 25.4M params |
|
| 8 |
+
| Task | 2x2 Rubik's Cube solving via imitation learning |
|
| 9 |
+
| Input | 24 sticker colors (flat encoding) + last 3 moves as history |
|
| 10 |
+
| Output | Single token from 19 classes (18 MOVE\_face\_turn + DONE) |
|
| 11 |
+
| Training | Supervised (cross-entropy) + auxiliary value head (MSE, weight 0.5) predicting distance-to-goal |
|
| 12 |
+
| DAgger | Mid-training on-policy collection at 50% of training: rollout on 200 random cubes, query teacher for corrections, adds ~2500 on-policy examples |
|
| 13 |
+
| Teacher | dwalton76/rubiks-cube-NxNxN-solver (optimal solver) |
|
| 14 |
+
| Evaluation | Rollout on 256 held-out scrambled cubes with hybrid search (model score + residual heuristic + state avoidance + no-inverse rule) |
|
| 15 |
+
|
| 16 |
+
## Results Progression
|
| 17 |
+
|
| 18 |
+
### Phase 1: Early Experiments (0% solve rate)
|
| 19 |
+
|
| 20 |
+
Structured token formats and unconstrained or constrained decoding. The model learned move accuracy but could not solve any cubes end-to-end.
|
| 21 |
+
|
| 22 |
+
| Experiment | Description | Move Acc | Solve Rate | Notes |
|
| 23 |
+
|---|---|---|---|---|
|
| 24 |
+
| 140918 | D=4 dim=256, structured 14-token, unconstrained | 21.8% | 0/256 | Baseline |
|
| 25 |
+
| 142741 | Constrained decoding, 256 episodes | 33.7% | 0/256 | |
|
| 26 |
+
| 143626 | +4096 episodes (38K examples) | 37.0% | 0/256 | |
|
| 27 |
+
| 144956 | Compact 3-token answer format | 43.9% | 0/256 | |
|
| 28 |
+
| 150805 | Flat state (no XML markers) | 44.2% | 0/256 | |
|
| 29 |
+
|
| 30 |
+
### Phase 2: Key Breakthroughs
|
| 31 |
+
|
| 32 |
+
| Experiment | Description | Solve Rate | Key Insight |
|
| 33 |
+
|---|---|---|---|
|
| 34 |
+
| exp6 | +action history(3) +no-inverse rule | 4/256 (1.6%) | First solves ever |
|
| 35 |
+
| exp10 | Joint MOVE\_face\_turn tokens (19-class) | 0/256 | Higher move acc (53%) but no solves without search |
|
| 36 |
+
| exp13 | Hybrid search + 2048 episodes | 12/256 (4.7%) | Search unlocks generalization |
|
| 37 |
+
| exp15 | 4096 episodes + hybrid search | 20/256 (7.8%) | More data helps |
|
| 38 |
+
| dagger1 | DAgger mid-training (2995 on-policy examples) | 40/256 (15.6%) | Biggest single lever |
|
| 39 |
+
| best | DAgger + auxiliary value head + residual search | 41/256 (16.0%) | Combined best techniques |
|
| 40 |
+
| rs100 | ROLLOUT\_MIN\_STEPS=100 | 56/256 (21.9%) | Model needed more eval steps |
|
| 41 |
+
|
| 42 |
+
### Phase 3: Scaling on MPS (MacBook)
|
| 43 |
+
|
| 44 |
+
| Experiment | Episodes | Time | Solve Rate | Notes |
|
| 45 |
+
|---|---|---|---|---|
|
| 46 |
+
| t1200e8k | 8192 | 20 min | 103/256 (40.2%) | More data + time wins |
|
| 47 |
+
|
| 48 |
+
### Phase 4: GPU Scaling (RTX 5090)
|
| 49 |
+
|
| 50 |
+
| Experiment | Model | Episodes | Time | Solve Rate | Notes |
|
| 51 |
+
|---|---|---|---|---|---|
|
| 52 |
+
| d8gpu | D=8, 25.4M params | 32K | 60 min | 239/256 (93.4%) | Larger model + more data |
|
| 53 |
+
| d8e64k | D=8, 25.4M params | 64K | 60 min | 256/256 (100%) | +ROLLOUT\_MIN\_STEPS=200 |
|
| 54 |
+
|
| 55 |
+
## What Worked
|
| 56 |
+
|
| 57 |
+
| Technique | Evidence | Impact |
|
| 58 |
+
|---|---|---|
|
| 59 |
+
| DAgger (mid-training on-policy data) | 20/256 -> 40/256 | ~2x, biggest single lever |
|
| 60 |
+
| Auxiliary value loss (weight 0.5) | 41/256 vs 9/256 ablation (noval) | 3.5x multiplier |
|
| 61 |
+
| Joint MOVE tokens (single token prediction) | Move acc 44% -> 53% | Simplified output space |
|
| 62 |
+
| Flat state encoding (no XML markers) | Shorter sequences, same accuracy | Faster training |
|
| 63 |
+
| Hybrid search (residual delta=2) | Unlocked first solves beyond greedy | Essential for generalization |
|
| 64 |
+
| Scaling data + compute together | 4K eps/10min -> 8K/20min -> 64K/60min | Consistent gains at every scale |
|
| 65 |
+
| Larger model on GPU (D=8, 25.4M params) | 40.2% -> 93.4% | Capacity was the bottleneck |
|
| 66 |
+
| ROLLOUT\_MIN\_STEPS=200 | 93.4% -> 100% | Model could solve more given time |
|
| 67 |
+
|
| 68 |
+
## What Didn't Work
|
| 69 |
+
|
| 70 |
+
| Technique | Experiment | Result |
|
| 71 |
+
|---|---|---|
|
| 72 |
+
| Bigger models on MPS | exp 150142 (D=6) | Too slow, fewer epochs, no gain |
|
| 73 |
+
| Value-guided search on MPS | valsrch, mlpsrch | Per-call overhead made eval take >1h |
|
| 74 |
+
| Value-primary loss (weight > 0.5) | vprimr (0.2 CE + 1.0 MSE) | Policy degraded, solve rate dropped |
|
| 75 |
+
| More DAgger data or rounds | dagger2a (300 eps), 2rnd (two rounds) | Over-diluted training distribution |
|
| 76 |
+
| Curriculum learning | curr1, curr3 vs nocurr | No clear benefit over uniform sampling |
|
| 77 |
+
| Weight decay | wd01 (0.1) | Regularization hurt, dropped to 11/256 |
|
| 78 |
+
| SEARCH\_RESIDUAL\_DELTA=1 | rd1 | 0/256 -- too restrictive, blocks sacrifice moves |
|
| 79 |
+
| SEARCH\_RESIDUAL\_DELTA=3 | rd3 | No gain over delta=2, too permissive |
|
| 80 |
+
| Longer training on same data | t1200 (20min, 4K eps) | Overfitted (loss=0.27 but no generalization gain) |
|
| 81 |
+
|
| 82 |
+
## Architecture Details
|
| 83 |
+
|
| 84 |
+
| Component | Value |
|
| 85 |
+
|---|---|
|
| 86 |
+
| Architecture | GPT-style Transformer |
|
| 87 |
+
| Depth | 8 layers |
|
| 88 |
+
| Model dim | 512 |
|
| 89 |
+
| Heads | 8 (GQA) |
|
| 90 |
+
| Parameters | 25.4M |
|
| 91 |
+
| Optimizer | MuonAdamW (Muon for matrix params, AdamW for embeddings/scalars) |
|
| 92 |
+
| Positional encoding | RoPE |
|
| 93 |
+
| Normalization | RMSNorm |
|
| 94 |
+
| Value Embeddings | ResFormer-style |
|
| 95 |
+
| Logit capping | Soft-capping at 15 |
|
| 96 |
+
| Compilation | torch.compile enabled on CUDA |
|
| 97 |
+
|
| 98 |
+
## Final Model Stats
|
| 99 |
+
|
| 100 |
+
| Metric | Value |
|
| 101 |
+
|---|---|
|
| 102 |
+
| Parameters | 25.4M |
|
| 103 |
+
| Training episodes | 64K (615K training examples + ~2500 DAgger examples) |
|
| 104 |
+
| Training steps | 51,547 |
|
| 105 |
+
| Training time | 60 minutes |
|
| 106 |
+
| Solve rate | 100% (256/256 held-out cubes) |
|
| 107 |
+
| Peak VRAM | 3.2 GB (RTX 5090) |
|
| 108 |
+
| MFU | 4.1% |
|
| 109 |
+
| Hardware | NVIDIA RTX 5090 |
|
| 110 |
+
|
| 111 |
+
## Full Experiment Log
|
| 112 |
+
|
| 113 |
+
All 28 experiments from `results.tsv`, ordered chronologically:
|
| 114 |
+
|
| 115 |
+
| # | Experiment | Solve Rate | Move Acc | Mean Residual | Steps | Status | Description |
|
| 116 |
+
|---|---|---|---|---|---|---|---|
|
| 117 |
+
| 1 | 140918 | 0.0% | 21.8% | 20.25 | 247 | keep | Baseline: D=4 dim=256, structured 14-tok, unconstrained |
|
| 118 |
+
| 2 | 142741 | 0.0% | 33.7% | 20.33 | 3596 | keep | Constrained decoding, 256 episodes |
|
| 119 |
+
| 3 | 143626 | 0.0% | 37.0% | 20.00 | 3632 | keep | +4096 episodes (38K examples) |
|
| 120 |
+
| 4 | 144956 | 0.0% | 43.9% | 19.95 | 5128 | keep | Compact 3-token format |
|
| 121 |
+
| 5 | 150142 | 0.0% | 34.9% | 19.92 | 3259 | discard | D=6 dim=384 10.7M -- too slow |
|
| 122 |
+
| 6 | 150805 | 0.0% | 44.2% | 19.48 | 4986 | keep | Flat state (no markers) |
|
| 123 |
+
| 7 | exp6 | 1.6% | 40.2% | 19.50 | 5326 | keep | +action history + no-inverse rule |
|
| 124 |
+
| 8 | exp10 | 0.0% | 53.0% | -- | 5913 | keep | Joint MOVE tokens (19-class) |
|
| 125 |
+
| 9 | exp13 | 4.7% | 52.0% | 10.62 | 6158 | keep | Hybrid search + 2048 episodes |
|
| 126 |
+
| 10 | exp15 | 7.8% | 50.0% | 10.52 | 4774 | keep | 4096 episodes + hybrid search |
|
| 127 |
+
| 11 | dagger1 | 15.6% | 61.5% | 9.15 | 5410 | keep | DAgger mid-training (2995 on-policy examples) |
|
| 128 |
+
| 12 | valaux | 11.3% | 57.1% | 9.54 | 4879 | keep | Value head as auxiliary loss |
|
| 129 |
+
| 13 | best | 16.0% | 62.6% | 8.97 | 5196 | keep | DAgger + aux value + residual search |
|
| 130 |
+
| 14 | lr10 | 16.0% | 63.0% | 8.21 | 5004 | keep | MATRIX\_LR=0.10 (best mean residual) |
|
| 131 |
+
| 15 | rs100 | 21.9% | 61.4% | 7.80 | 4158 | keep | ROLLOUT\_MIN\_STEPS=100 |
|
| 132 |
+
| 16 | t1200e8k | 40.2% | 67.9% | 5.95 | 19660 | keep | 8K eps + 20min (MPS scaling) |
|
| 133 |
+
| 17 | d8gpu | 93.4% | 82.6% | -- | 50078 | keep | D=8 + 32K eps + 60min (RTX 5090) |
|
| 134 |
+
| 18 | d8e64k | 100.0% | 84.0% | -- | 51547 | keep | D=8 + 64K eps + ROLLOUT\_MIN\_STEPS=200 |
|