soamikapadia
/

rubiks-2x2-solver

imitation-learning

Model card Files Files and versions

rubiks-2x2-solver / REPORT.md

soamikapadia's picture

Upload REPORT.md with huggingface_hub

d2ee959 verified about 1 month ago

|

history blame contribute delete

7.02 kB

Training Report: 2x2 Rubik's Cube Solver via Imitation Learning

Key Facts

Property	Value
Model	Transformer (GPT-style), D=8, dim=512, 8 heads, 25.4M params
Task	2x2 Rubik's Cube solving via imitation learning
Input	24 sticker colors (flat encoding) + last 3 moves as history
Output	Single token from 19 classes (18 MOVE_face_turn + DONE)
Training	Supervised (cross-entropy) + auxiliary value head (MSE, weight 0.5) predicting distance-to-goal
DAgger	Mid-training on-policy collection at 50% of training: rollout on 200 random cubes, query teacher for corrections, adds ~2500 on-policy examples
Teacher	dwalton76/rubiks-cube-NxNxN-solver (optimal solver)
Evaluation	Rollout on 256 held-out scrambled cubes with hybrid search (model score + residual heuristic + state avoidance + no-inverse rule)

Results Progression

Phase 1: Early Experiments (0% solve rate)

Structured token formats and unconstrained or constrained decoding. The model learned move accuracy but could not solve any cubes end-to-end.

Experiment	Description	Move Acc	Solve Rate	Notes
140918	D=4 dim=256, structured 14-token, unconstrained	21.8%	0/256	Baseline
142741	Constrained decoding, 256 episodes	33.7%	0/256
143626	+4096 episodes (38K examples)	37.0%	0/256
144956	Compact 3-token answer format	43.9%	0/256
150805	Flat state (no XML markers)	44.2%	0/256

Phase 2: Key Breakthroughs

Experiment	Description	Solve Rate	Key Insight
exp6	+action history(3) +no-inverse rule	4/256 (1.6%)	First solves ever
exp10	Joint MOVE_face_turn tokens (19-class)	0/256	Higher move acc (53%) but no solves without search
exp13	Hybrid search + 2048 episodes	12/256 (4.7%)	Search unlocks generalization
exp15	4096 episodes + hybrid search	20/256 (7.8%)	More data helps
dagger1	DAgger mid-training (2995 on-policy examples)	40/256 (15.6%)	Biggest single lever
best	DAgger + auxiliary value head + residual search	41/256 (16.0%)	Combined best techniques
rs100	ROLLOUT_MIN_STEPS=100	56/256 (21.9%)	Model needed more eval steps

Phase 3: Scaling on MPS (MacBook)

Experiment	Episodes	Time	Solve Rate	Notes
t1200e8k	8192	20 min	103/256 (40.2%)	More data + time wins

Phase 4: GPU Scaling (RTX 5090)

Experiment	Model	Episodes	Time	Solve Rate	Notes
d8gpu	D=8, 25.4M params	32K	60 min	239/256 (93.4%)	Larger model + more data
d8e64k	D=8, 25.4M params	64K	60 min	256/256 (100%)	+ROLLOUT_MIN_STEPS=200

What Worked

Technique	Evidence	Impact
DAgger (mid-training on-policy data)	20/256 -> 40/256	~2x, biggest single lever
Auxiliary value loss (weight 0.5)	41/256 vs 9/256 ablation (noval)	3.5x multiplier
Joint MOVE tokens (single token prediction)	Move acc 44% -> 53%	Simplified output space
Flat state encoding (no XML markers)	Shorter sequences, same accuracy	Faster training
Hybrid search (residual delta=2)	Unlocked first solves beyond greedy	Essential for generalization
Scaling data + compute together	4K eps/10min -> 8K/20min -> 64K/60min	Consistent gains at every scale
Larger model on GPU (D=8, 25.4M params)	40.2% -> 93.4%	Capacity was the bottleneck
ROLLOUT_MIN_STEPS=200	93.4% -> 100%	Model could solve more given time

What Didn't Work

Technique	Experiment	Result
Bigger models on MPS	exp 150142 (D=6)	Too slow, fewer epochs, no gain
Value-guided search on MPS	valsrch, mlpsrch	Per-call overhead made eval take >1h
Value-primary loss (weight > 0.5)	vprimr (0.2 CE + 1.0 MSE)	Policy degraded, solve rate dropped
More DAgger data or rounds	dagger2a (300 eps), 2rnd (two rounds)	Over-diluted training distribution
Curriculum learning	curr1, curr3 vs nocurr	No clear benefit over uniform sampling
Weight decay	wd01 (0.1)	Regularization hurt, dropped to 11/256
SEARCH_RESIDUAL_DELTA=1	rd1	0/256 -- too restrictive, blocks sacrifice moves
SEARCH_RESIDUAL_DELTA=3	rd3	No gain over delta=2, too permissive
Longer training on same data	t1200 (20min, 4K eps)	Overfitted (loss=0.27 but no generalization gain)

Architecture Details

Component	Value
Architecture	GPT-style Transformer
Depth	8 layers
Model dim	512
Heads	8 (GQA)
Parameters	25.4M
Optimizer	MuonAdamW (Muon for matrix params, AdamW for embeddings/scalars)
Positional encoding	RoPE
Normalization	RMSNorm
Value Embeddings	ResFormer-style
Logit capping	Soft-capping at 15
Compilation	torch.compile enabled on CUDA

Final Model Stats

Metric	Value
Parameters	25.4M
Training episodes	64K (615K training examples + ~2500 DAgger examples)
Training steps	51,547
Training time	60 minutes
Solve rate	100% (256/256 held-out cubes)
Peak VRAM	3.2 GB (RTX 5090)
MFU	4.1%
Hardware	NVIDIA RTX 5090

Full Experiment Log

All 28 experiments from results.tsv, ordered chronologically:

#	Experiment	Solve Rate	Move Acc	Mean Residual	Steps	Status	Description
1	140918	0.0%	21.8%	20.25	247	keep	Baseline: D=4 dim=256, structured 14-tok, unconstrained
2	142741	0.0%	33.7%	20.33	3596	keep	Constrained decoding, 256 episodes
3	143626	0.0%	37.0%	20.00	3632	keep	+4096 episodes (38K examples)
4	144956	0.0%	43.9%	19.95	5128	keep	Compact 3-token format
5	150142	0.0%	34.9%	19.92	3259	discard	D=6 dim=384 10.7M -- too slow
6	150805	0.0%	44.2%	19.48	4986	keep	Flat state (no markers)
7	exp6	1.6%	40.2%	19.50	5326	keep	+action history + no-inverse rule
8	exp10	0.0%	53.0%	--	5913	keep	Joint MOVE tokens (19-class)
9	exp13	4.7%	52.0%	10.62	6158	keep	Hybrid search + 2048 episodes
10	exp15	7.8%	50.0%	10.52	4774	keep	4096 episodes + hybrid search
11	dagger1	15.6%	61.5%	9.15	5410	keep	DAgger mid-training (2995 on-policy examples)
12	valaux	11.3%	57.1%	9.54	4879	keep	Value head as auxiliary loss
13	best	16.0%	62.6%	8.97	5196	keep	DAgger + aux value + residual search
14	lr10	16.0%	63.0%	8.21	5004	keep	MATRIX_LR=0.10 (best mean residual)
15	rs100	21.9%	61.4%	7.80	4158	keep	ROLLOUT_MIN_STEPS=100
16	t1200e8k	40.2%	67.9%	5.95	19660	keep	8K eps + 20min (MPS scaling)
17	d8gpu	93.4%	82.6%	--	50078	keep	D=8 + 32K eps + 60min (RTX 5090)
18	d8e64k	100.0%	84.0%	--	51547	keep	D=8 + 64K eps + ROLLOUT_MIN_STEPS=200