Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
juiceb0xc0de 
posted an update 10 days ago
Post
94
Okay, I may have been talking out of my ass about my scheduler using less VRAM compared to a FFT. What I did find though: training only ~30% of the model's weights per step consistently beat dense SFT on Hendrycks Math across 3 different seeds.

What makes it interesting isn't just the sparsity — it's that no two consecutive windows share the same active layers. The model never has a stable path from input to output decision. Adjacent layers are rarely both alive at the same time, so the model can't build shortcuts between them. I started developing this to reduce semantic redundancy across layers and stumbled onto something I didn't expect.

Results (0-shot, hendrycks_math exact match):

Dense SFT baseline: 0.0098
DeepChaos seed 1: 0.0142 (+45%)
DeepChaos seed 2: 0.0156 (+59%)
DeepChaos seed 3: 0.0138 (+41%)

Setup: Qwen2.5-3B-Instruct, simplescaling/s1K (1k reasoning traces), 5 epochs, LR 1e-5, optimizer adamw_torch_fused , and cosine scheduler with my lucky pick scheduler on an AMD MI300X 192GB.

The scheduler is still a work in progress but the current version is fully operational. You can check it out at:
https://github.com/JuiceB0xC0de/lucky-pick-scheduler

I would love to hear your experiences with sparsity training!
In this post