File size: 2,813 Bytes
7b94b0b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---

license: apache-2.0
base_model: HuggingFaceTB/SmolLM-135M
datasets:
- openai/gsm8k
- meta-math/MetaMathQA
- AI-MO/NuminaMath-1.5
tags:
- math
- reasoning
- efficient-training
- cggr
- sparse-gradients
model_name: SmolLM-135M-CGGR-Math
---


# SmolLM-135M-CGGR-Math

This model is a specialized version of **HuggingFaceTB/SmolLM-135M**, fine-tuned for mathematical reasoning using **Confidence-Gated Gradient Routing (CGGR)**.

## 🚀 The CGGR Breakthrough

This model was trained using a novel training strategy that selects only the "hardest" tokens for gradient updates, allowing for:
- **4.08x Higher Throughput:** Processing 4x more data in the same wall-clock time compared to standard training.
- **66% VRAM Savings:** Fitting large-batch training on consumer hardware (RTX 3060).
- **Superior Convergence:** Achieving a **+19% relative accuracy improvement** on math reasoning tasks (AIME 2024) compared to standard fine-tuning.

### Benchmark Results (6-Hour Training Race)

| Metric                      | Standard (Baseline) | CGGR (Ours)        | Relative Gain     |
| :-------------------------- | :------------------ | :----------------- | :---------------- |
| **Solving Accuracy (AIME)** | 8.00%               | **9.50%**          | **+18.75%**       |
| **Training Throughput**     | 14,368 samples      | **58,716 samples** | **+308%**         |
| **Final Loss**              | 0.3610              | **0.0980**         | **-73% Error**    |
| **Max Batch Size (12GB)**   | 18                  | **69**             | **3.8x Capacity** |

## 📈 Performance Visuals

![Benchmark Dashboard](https://huggingface.co/MinimaML/SmolLM-135M-CGGR-Math/resolve/main/benchmark_dashboard.png)

## Model Details

- **Architecture:** Transformer Decoder (SmolLM-135M)
- **Training Method:** CGGR (Confidence-Gated Gradient Routing)
- **Selection Strategy:** Fixed Quota (Top 25% hardest tokens)
- **Compute:** Trained on a single NVIDIA RTX 3060 (12GB)
- **Duration:** 6 Total Hours

## Usage

```python

from transformers import AutoModelForCausalLM, AutoTokenizer



model_name = "MinimaML/SmolLM-135M-CGGR-Math"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)



prompt = "Question: If x + y = 10 and x - y = 2, what is the value of x?\n\nAnswer:"

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## Citation

If you use this model or the CGGR technique in your research, please cite:

```bibtex

@software{cggr2026,

  title={CGGR: Confidence-Gated Gradient Routing},

  author={MinimaML},

  year={2026},

  url={https://github.com/MinimaML/CGGR}

}

```