MinimaML
/

SmolLM-135M-CGGR-Math

efficient-training

sparse-gradients

Model card Files Files and versions

SmolLM-135M-CGGR-Math / README.md

Wilbatronic's picture

Initial upload of CGGR-specialized Math model

7b94b0b verified 21 days ago

|

history blame contribute delete

2.81 kB

	---
	license: apache-2.0
	base_model: HuggingFaceTB/SmolLM-135M
	datasets:
	- openai/gsm8k
	- meta-math/MetaMathQA
	- AI-MO/NuminaMath-1.5
	tags:
	- math
	- reasoning
	- efficient-training
	- cggr
	- sparse-gradients
	model_name: SmolLM-135M-CGGR-Math
	---

	# SmolLM-135M-CGGR-Math

	This model is a specialized version of HuggingFaceTB/SmolLM-135M, fine-tuned for mathematical reasoning using Confidence-Gated Gradient Routing (CGGR).

	## 🚀 The CGGR Breakthrough

	This model was trained using a novel training strategy that selects only the "hardest" tokens for gradient updates, allowing for:
	- 4.08x Higher Throughput: Processing 4x more data in the same wall-clock time compared to standard training.
	- 66% VRAM Savings: Fitting large-batch training on consumer hardware (RTX 3060).
	- Superior Convergence: Achieving a +19% relative accuracy improvement on math reasoning tasks (AIME 2024) compared to standard fine-tuning.

	### Benchmark Results (6-Hour Training Race)

	\| Metric \| Standard (Baseline) \| CGGR (Ours) \| Relative Gain \|
	\| :-------------------------- \| :------------------ \| :----------------- \| :---------------- \|
	\| Solving Accuracy (AIME) \| 8.00% \| 9.50% \| +18.75% \|
	\| Training Throughput \| 14,368 samples \| 58,716 samples \| +308% \|
	\| Final Loss \| 0.3610 \| 0.0980 \| -73% Error \|
	\| Max Batch Size (12GB) \| 18 \| 69 \| 3.8x Capacity \|

	## 📈 Performance Visuals

	![Benchmark Dashboard](https://huggingface.co/MinimaML/SmolLM-135M-CGGR-Math/resolve/main/benchmark_dashboard.png)

	## Model Details

	- Architecture: Transformer Decoder (SmolLM-135M)
	- Training Method: CGGR (Confidence-Gated Gradient Routing)
	- Selection Strategy: Fixed Quota (Top 25% hardest tokens)
	- Compute: Trained on a single NVIDIA RTX 3060 (12GB)
	- Duration: 6 Total Hours

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "MinimaML/SmolLM-135M-CGGR-Math"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	prompt = "Question: If x + y = 10 and x - y = 2, what is the value of x?\n\nAnswer:"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Citation

	If you use this model or the CGGR technique in your research, please cite:

	```bibtex
	@software{cggr2026,
	title={CGGR: Confidence-Gated Gradient Routing},
	author={MinimaML},
	year={2026},
	url={https://github.com/MinimaML/CGGR}
	}
	```