soamikapadia commited on
Commit
d2ee959
·
verified ·
1 Parent(s): bbe2c6c

Upload REPORT.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. REPORT.md +134 -0
REPORT.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Report: 2x2 Rubik's Cube Solver via Imitation Learning
2
+
3
+ ## Key Facts
4
+
5
+ | Property | Value |
6
+ |---|---|
7
+ | Model | Transformer (GPT-style), D=8, dim=512, 8 heads, 25.4M params |
8
+ | Task | 2x2 Rubik's Cube solving via imitation learning |
9
+ | Input | 24 sticker colors (flat encoding) + last 3 moves as history |
10
+ | Output | Single token from 19 classes (18 MOVE\_face\_turn + DONE) |
11
+ | Training | Supervised (cross-entropy) + auxiliary value head (MSE, weight 0.5) predicting distance-to-goal |
12
+ | DAgger | Mid-training on-policy collection at 50% of training: rollout on 200 random cubes, query teacher for corrections, adds ~2500 on-policy examples |
13
+ | Teacher | dwalton76/rubiks-cube-NxNxN-solver (optimal solver) |
14
+ | Evaluation | Rollout on 256 held-out scrambled cubes with hybrid search (model score + residual heuristic + state avoidance + no-inverse rule) |
15
+
16
+ ## Results Progression
17
+
18
+ ### Phase 1: Early Experiments (0% solve rate)
19
+
20
+ Structured token formats and unconstrained or constrained decoding. The model learned move accuracy but could not solve any cubes end-to-end.
21
+
22
+ | Experiment | Description | Move Acc | Solve Rate | Notes |
23
+ |---|---|---|---|---|
24
+ | 140918 | D=4 dim=256, structured 14-token, unconstrained | 21.8% | 0/256 | Baseline |
25
+ | 142741 | Constrained decoding, 256 episodes | 33.7% | 0/256 | |
26
+ | 143626 | +4096 episodes (38K examples) | 37.0% | 0/256 | |
27
+ | 144956 | Compact 3-token answer format | 43.9% | 0/256 | |
28
+ | 150805 | Flat state (no XML markers) | 44.2% | 0/256 | |
29
+
30
+ ### Phase 2: Key Breakthroughs
31
+
32
+ | Experiment | Description | Solve Rate | Key Insight |
33
+ |---|---|---|---|
34
+ | exp6 | +action history(3) +no-inverse rule | 4/256 (1.6%) | First solves ever |
35
+ | exp10 | Joint MOVE\_face\_turn tokens (19-class) | 0/256 | Higher move acc (53%) but no solves without search |
36
+ | exp13 | Hybrid search + 2048 episodes | 12/256 (4.7%) | Search unlocks generalization |
37
+ | exp15 | 4096 episodes + hybrid search | 20/256 (7.8%) | More data helps |
38
+ | dagger1 | DAgger mid-training (2995 on-policy examples) | 40/256 (15.6%) | Biggest single lever |
39
+ | best | DAgger + auxiliary value head + residual search | 41/256 (16.0%) | Combined best techniques |
40
+ | rs100 | ROLLOUT\_MIN\_STEPS=100 | 56/256 (21.9%) | Model needed more eval steps |
41
+
42
+ ### Phase 3: Scaling on MPS (MacBook)
43
+
44
+ | Experiment | Episodes | Time | Solve Rate | Notes |
45
+ |---|---|---|---|---|
46
+ | t1200e8k | 8192 | 20 min | 103/256 (40.2%) | More data + time wins |
47
+
48
+ ### Phase 4: GPU Scaling (RTX 5090)
49
+
50
+ | Experiment | Model | Episodes | Time | Solve Rate | Notes |
51
+ |---|---|---|---|---|---|
52
+ | d8gpu | D=8, 25.4M params | 32K | 60 min | 239/256 (93.4%) | Larger model + more data |
53
+ | d8e64k | D=8, 25.4M params | 64K | 60 min | 256/256 (100%) | +ROLLOUT\_MIN\_STEPS=200 |
54
+
55
+ ## What Worked
56
+
57
+ | Technique | Evidence | Impact |
58
+ |---|---|---|
59
+ | DAgger (mid-training on-policy data) | 20/256 -> 40/256 | ~2x, biggest single lever |
60
+ | Auxiliary value loss (weight 0.5) | 41/256 vs 9/256 ablation (noval) | 3.5x multiplier |
61
+ | Joint MOVE tokens (single token prediction) | Move acc 44% -> 53% | Simplified output space |
62
+ | Flat state encoding (no XML markers) | Shorter sequences, same accuracy | Faster training |
63
+ | Hybrid search (residual delta=2) | Unlocked first solves beyond greedy | Essential for generalization |
64
+ | Scaling data + compute together | 4K eps/10min -> 8K/20min -> 64K/60min | Consistent gains at every scale |
65
+ | Larger model on GPU (D=8, 25.4M params) | 40.2% -> 93.4% | Capacity was the bottleneck |
66
+ | ROLLOUT\_MIN\_STEPS=200 | 93.4% -> 100% | Model could solve more given time |
67
+
68
+ ## What Didn't Work
69
+
70
+ | Technique | Experiment | Result |
71
+ |---|---|---|
72
+ | Bigger models on MPS | exp 150142 (D=6) | Too slow, fewer epochs, no gain |
73
+ | Value-guided search on MPS | valsrch, mlpsrch | Per-call overhead made eval take >1h |
74
+ | Value-primary loss (weight > 0.5) | vprimr (0.2 CE + 1.0 MSE) | Policy degraded, solve rate dropped |
75
+ | More DAgger data or rounds | dagger2a (300 eps), 2rnd (two rounds) | Over-diluted training distribution |
76
+ | Curriculum learning | curr1, curr3 vs nocurr | No clear benefit over uniform sampling |
77
+ | Weight decay | wd01 (0.1) | Regularization hurt, dropped to 11/256 |
78
+ | SEARCH\_RESIDUAL\_DELTA=1 | rd1 | 0/256 -- too restrictive, blocks sacrifice moves |
79
+ | SEARCH\_RESIDUAL\_DELTA=3 | rd3 | No gain over delta=2, too permissive |
80
+ | Longer training on same data | t1200 (20min, 4K eps) | Overfitted (loss=0.27 but no generalization gain) |
81
+
82
+ ## Architecture Details
83
+
84
+ | Component | Value |
85
+ |---|---|
86
+ | Architecture | GPT-style Transformer |
87
+ | Depth | 8 layers |
88
+ | Model dim | 512 |
89
+ | Heads | 8 (GQA) |
90
+ | Parameters | 25.4M |
91
+ | Optimizer | MuonAdamW (Muon for matrix params, AdamW for embeddings/scalars) |
92
+ | Positional encoding | RoPE |
93
+ | Normalization | RMSNorm |
94
+ | Value Embeddings | ResFormer-style |
95
+ | Logit capping | Soft-capping at 15 |
96
+ | Compilation | torch.compile enabled on CUDA |
97
+
98
+ ## Final Model Stats
99
+
100
+ | Metric | Value |
101
+ |---|---|
102
+ | Parameters | 25.4M |
103
+ | Training episodes | 64K (615K training examples + ~2500 DAgger examples) |
104
+ | Training steps | 51,547 |
105
+ | Training time | 60 minutes |
106
+ | Solve rate | 100% (256/256 held-out cubes) |
107
+ | Peak VRAM | 3.2 GB (RTX 5090) |
108
+ | MFU | 4.1% |
109
+ | Hardware | NVIDIA RTX 5090 |
110
+
111
+ ## Full Experiment Log
112
+
113
+ All 28 experiments from `results.tsv`, ordered chronologically:
114
+
115
+ | # | Experiment | Solve Rate | Move Acc | Mean Residual | Steps | Status | Description |
116
+ |---|---|---|---|---|---|---|---|
117
+ | 1 | 140918 | 0.0% | 21.8% | 20.25 | 247 | keep | Baseline: D=4 dim=256, structured 14-tok, unconstrained |
118
+ | 2 | 142741 | 0.0% | 33.7% | 20.33 | 3596 | keep | Constrained decoding, 256 episodes |
119
+ | 3 | 143626 | 0.0% | 37.0% | 20.00 | 3632 | keep | +4096 episodes (38K examples) |
120
+ | 4 | 144956 | 0.0% | 43.9% | 19.95 | 5128 | keep | Compact 3-token format |
121
+ | 5 | 150142 | 0.0% | 34.9% | 19.92 | 3259 | discard | D=6 dim=384 10.7M -- too slow |
122
+ | 6 | 150805 | 0.0% | 44.2% | 19.48 | 4986 | keep | Flat state (no markers) |
123
+ | 7 | exp6 | 1.6% | 40.2% | 19.50 | 5326 | keep | +action history + no-inverse rule |
124
+ | 8 | exp10 | 0.0% | 53.0% | -- | 5913 | keep | Joint MOVE tokens (19-class) |
125
+ | 9 | exp13 | 4.7% | 52.0% | 10.62 | 6158 | keep | Hybrid search + 2048 episodes |
126
+ | 10 | exp15 | 7.8% | 50.0% | 10.52 | 4774 | keep | 4096 episodes + hybrid search |
127
+ | 11 | dagger1 | 15.6% | 61.5% | 9.15 | 5410 | keep | DAgger mid-training (2995 on-policy examples) |
128
+ | 12 | valaux | 11.3% | 57.1% | 9.54 | 4879 | keep | Value head as auxiliary loss |
129
+ | 13 | best | 16.0% | 62.6% | 8.97 | 5196 | keep | DAgger + aux value + residual search |
130
+ | 14 | lr10 | 16.0% | 63.0% | 8.21 | 5004 | keep | MATRIX\_LR=0.10 (best mean residual) |
131
+ | 15 | rs100 | 21.9% | 61.4% | 7.80 | 4158 | keep | ROLLOUT\_MIN\_STEPS=100 |
132
+ | 16 | t1200e8k | 40.2% | 67.9% | 5.95 | 19660 | keep | 8K eps + 20min (MPS scaling) |
133
+ | 17 | d8gpu | 93.4% | 82.6% | -- | 50078 | keep | D=8 + 32K eps + 60min (RTX 5090) |
134
+ | 18 | d8e64k | 100.0% | 84.0% | -- | 51547 | keep | D=8 + 64K eps + ROLLOUT\_MIN\_STEPS=200 |