GPT-OSS-20B NuminaMath
Overview
This repository provides GPT-OSS-20B model fine-tuned on the NuminaMath-TIR dataset which consists of 70k data points to improve mathematical olympiad reasoning and structured problem solving.
The adapters are designed to be used with the base model gpt-oss-20b, a Mixture-of-Experts (MoE) transformer architecture. Fine-tuning focuses on improving the model’s ability to generate step-by-step reasoning, symbolic manipulation, and detailed mathematical explanations when solving math problems.
Instead of updating the full model weights, parameter-efficient fine-tuning (PEFT) was used to modify only a small number of parameters in the attention layers. This allows the adapters to significantly improve reasoning ability while keeping training compute requirements relatively low.
The resulting LoRA adapters can be loaded on top of the base model to enhance its performance on mathematical olympiad reasoning tasks such as algebra, arithmetic, and problem-solving explanations.
Model Details
| Field | Value |
|---|---|
| Base Model | gpt-oss-20b |
| Architecture | Mixture-of-Experts Transformer |
| Fine-Tuning Method | LoRA (PEFT) |
| Precision | BF16 |
| Context Length | 8192 tokens |
| Training Hardware | NVIDIA H100 |
| Framework | PyTorch + Transformers + PEFT |
Training Data
Dataset
The model was fine-tuned using the NuminaMath-TIR dataset, which contains mathematical problems paired with structured reasoning traces and final answers.
Dataset link: https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
The dataset includes problems across several mathematical domains including:
- arithmetic
- algebra
- number theory
- geometry
- calculus
- reasoning-based problem solving
The dataset emphasizes step-by-step explanations, allowing the model to learn how to produce reasoning chains rather than only final answers.
Dataset Processing
The dataset was originally provided as a CSV file and processed prior to training.
Processing pipeline:
- Loaded using pandas
- Columns normalized to:
promptresponse
- Empty rows removed
- Converted to Hugging Face Dataset format
- Randomized train/validation split
Dataset split:
| Split | Percentage |
|---|---|
| Train | 95% |
| Validation | 5% |
Instruction Format
Training samples were converted into the following chat-style instruction format compatible with the GPT-OSS tokenizer.
<|im_start|>user
{prompt}
<|im_end|>
<|im_start|>assistant
{response}
<|im_end|>
This format enables the model to learn structured conversational reasoning and aligns with the instruction format used in many modern LLMs.
Training Procedure
The model was fine-tuned using LoRA adapters applied only to attention layers.
Because gpt-oss-20b is a Mixture-of-Experts (MoE) architecture, LoRA was intentionally not applied to expert layers in order to preserve the routing structure and maintain training stability.
LoRA Target Modules
Adapters were applied to the following projection layers:
q_proj
k_proj
v_proj
o_proj
These correspond to the query, key, value, and output projections within the attention mechanism.
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Bias | none |
Only attention projections were modified, ensuring minimal disruption to the base model while still enabling meaningful behavioral improvements.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 2 |
| Learning Rate | 2e-4 |
| Optimizer | AdamW (fused) |
| Adam β1 | 0.9 |
| Adam β2 | 0.95 |
| Weight Decay | 0.01 |
| Warmup Ratio | 0.03 |
| Max Gradient Norm | 1.0 |
Batch configuration:
| Parameter | Value |
|---|---|
| Per Device Batch Size | 4 |
| Gradient Accumulation | 4 |
| Effective Batch Size | 16 |
Maximum sequence length:
8192 tokens
Training Infrastructure
Training was performed on the following hardware:
1× NVIDIA H100 GPU
Training optimizations included:
- Flash Attention 2
- BF16 mixed precision
- TF32 enabled
- Gradient checkpointing
- memory-optimized LoRA configuration
MoE compatibility adjustments included:
- LoRA applied only to attention layers
- CPU offloading disabled
- gradient checkpointing configured with
use_reentrant=False
Training frameworks used:
- PyTorch
- Hugging Face Transformers
- PEFT
- Hugging Face Datasets
Evaluation
Validation was performed periodically during training using validation loss.
Metrics monitored:
- training loss
- validation loss
The model was trained for exactly 2 epochs on the entire dataset without automated checkpoint selection. The final validation loss is 0.4039 for 2 full epochs.
Intended Use
This model is intended for:
- mathematical reasoning research
- educational demonstrations
- experimentation with reasoning-focused fine-tuning
- evaluation of math-capable language models
It is not intended for high-stakes mathematical or scientific applications.
Limitations
Despite improvements from fine-tuning, the model still has several limitations:
- The model may generate incorrect reasoning steps.
- Mathematical derivations may lack formal rigor.
- Some areas of mathematics may be underrepresented in the dataset.
- Performance depends strongly on the capabilities of the base model.
Users should treat model outputs as assistive suggestions rather than authoritative answers.
Ethical Considerations
Language models trained for reasoning may produce confident but incorrect explanations.
For educational or academic use:
- outputs should be verified independently
- the model should not be treated as an authoritative mathematical source
Acknowledgements
This work builds upon the open-source ecosystem including:
- Hugging Face Transformers
- the PEFT library for parameter-efficient fine-tuning
- the NuminaMath dataset
- research on Mixture-of-Experts transformer architectures
Citation
Dataset:
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
Training Notebook:
https://www.kaggle.com/code/tensorhydra/gpt-oss-20b-finetune-numinamath
- Downloads last month
- 282
