|
|
---
|
|
|
license: apache-2.0
|
|
|
base_model: HuggingFaceTB/SmolLM-135M
|
|
|
datasets:
|
|
|
- openai/gsm8k
|
|
|
- meta-math/MetaMathQA
|
|
|
- AI-MO/NuminaMath-1.5
|
|
|
tags:
|
|
|
- math
|
|
|
- reasoning
|
|
|
- efficient-training
|
|
|
- cggr
|
|
|
- sparse-gradients
|
|
|
model_name: SmolLM-135M-CGGR-Math
|
|
|
---
|
|
|
|
|
|
# SmolLM-135M-CGGR-Math
|
|
|
|
|
|
This model is a specialized version of **HuggingFaceTB/SmolLM-135M**, fine-tuned for mathematical reasoning using **Confidence-Gated Gradient Routing (CGGR)**.
|
|
|
|
|
|
## ๐ The CGGR Breakthrough
|
|
|
|
|
|
This model was trained using a novel training strategy that selects only the "hardest" tokens for gradient updates, allowing for:
|
|
|
- **4.08x Higher Throughput:** Processing 4x more data in the same wall-clock time compared to standard training.
|
|
|
- **66% VRAM Savings:** Fitting large-batch training on consumer hardware (RTX 3060).
|
|
|
- **Superior Convergence:** Achieving a **+19% relative accuracy improvement** on math reasoning tasks (AIME 2024) compared to standard fine-tuning.
|
|
|
|
|
|
### Benchmark Results (6-Hour Training Race)
|
|
|
|
|
|
| Metric | Standard (Baseline) | CGGR (Ours) | Relative Gain |
|
|
|
| :-------------------------- | :------------------ | :----------------- | :---------------- |
|
|
|
| **Solving Accuracy (AIME)** | 8.00% | **9.50%** | **+18.75%** |
|
|
|
| **Training Throughput** | 14,368 samples | **58,716 samples** | **+308%** |
|
|
|
| **Final Loss** | 0.3610 | **0.0980** | **-73% Error** |
|
|
|
| **Max Batch Size (12GB)** | 18 | **69** | **3.8x Capacity** |
|
|
|
|
|
|
## ๐ Performance Visuals
|
|
|
|
|
|

|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
- **Architecture:** Transformer Decoder (SmolLM-135M)
|
|
|
- **Training Method:** CGGR (Confidence-Gated Gradient Routing)
|
|
|
- **Selection Strategy:** Fixed Quota (Top 25% hardest tokens)
|
|
|
- **Compute:** Trained on a single NVIDIA RTX 3060 (12GB)
|
|
|
- **Duration:** 6 Total Hours
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
|
|
model_name = "MinimaML/SmolLM-135M-CGGR-Math"
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name)
|
|
|
|
|
|
prompt = "Question: If x + y = 10 and x - y = 2, what is the value of x?\n\nAnswer:"
|
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
|
|
outputs = model.generate(**inputs, max_new_tokens=50)
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
|
```
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
If you use this model or the CGGR technique in your research, please cite:
|
|
|
|
|
|
```bibtex
|
|
|
@software{cggr2026,
|
|
|
title={CGGR: Confidence-Gated Gradient Routing},
|
|
|
author={MinimaML},
|
|
|
year={2026},
|
|
|
url={https://github.com/MinimaML/CGGR}
|
|
|
}
|
|
|
```
|
|
|
|