QED-Nano SFT
Table of Contents
Model Summary
QED-Nano SFT is a 4B parameter mathematical reasoning model fine-tuned on olympiad-level proof problems. This is the supervised fine-tuning (SFT) checkpoint trained via knowledge distillation from deepseek-ai/DeepSeek-Math-V2 (685B parameters). The model generates mathematical proofs with explicit chain-of-thought reasoning enclosed in <think> tags.
This model serves as the foundation for the full QED-Nano pipeline, which includes subsequent reinforcement learning and reasoning cache extensions. For the complete trained model, see lm-provers/QED-Nano.
Key features
- Reasoning model optimized for theorem proving
- Supervised fine-tuning checkpoint: Foundation for the full QED-Nano model (SFT + RL + Reasoning Cache)
- Distilled from DeepSeek-Math-V2 (685B): Knowledge distillation to 4B parameters
- Fully open: open weights + full training details including public data mixture and training configs
For more details refer to our blog post: https://huggingface.co/spaces/lm-provers/qed-nano-blogpost
How to use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "lm-provers/QED-Nano-SFT"
device = "cuda" # for GPU usage or "cpu" for CPU usage
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
).to(device)
# prepare the model input
prompt = "Generate a rigorous proof to the following question: is \sqrt{2} rational or irrational?"
messages_think = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages_think,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generate the output
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
We recommend setting
temperature=0.6andtop_p=0.95in the sampling parameters.
vLLM and SGLang
You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.
SGLang
python -m sglang.launch_server --model-path lm-provers/QED-Nano-SFT
vLLM
vllm serve lm-provers/QED-Nano-SFT
Evaluation
In this section, we report the evaluation results of QED-Nano on IMO-ProofBench, ProofBench, and IMO-AnswerBench. All evaluations except those on IMO-AnswerBench are reported as avg@3 unless stated otherwise.
| Model | IMO-ProofBench | ProofBench | IMO-AnswerBench |
|---|---|---|---|
| Qwen3-4B-Thinking-2507 | 20.4 (2.6) | 19.5 (0.9) | 55.8 |
| QED-Nano-SFT | 39.5 (2.9) | 33.3 (0.5) | 57.5 |
| QED-Nano | 40.0 (0.6) | 44.9 (3.4) | 67.5 |
| QED-Nano (Agent) | 54.0 (3.7) | 54.4 (2.4) | - |
| Qwen3-30B-A3B-Thinking-2507 | 27.6 (1.0) | 26.1 (2.4) | 67.0 |
| Qwen3-235B-A22B-Thinking-2507 | 34.1 (0.7) | 33.7 (1.1) | 70.5 |
| Nomos-1 | 40.3 (3.5) | 28.3 (3.9) | 49.0 |
| GPT-OSS-20B | 38.3 (1.2) | 38.4 (3.9) | 61.5 |
| GPT-OSS-120B | 43.1 (3.2) | 47.5 (1.7) | 70.5 |
| DeepSeek-Math-V2 | 57.9 (2.0) | 60.6 (0.1) | 75.8 |
| Gemini 3 Pro | 58.7 (2.9) | 66.7 (3.1) | 83.2 |
Note: This is the SFT-only checkpoint. Performance improves significantly with RL training and reasoning cache. See the full lm-provers/QED-Nano model for complete results.
Training
Model
- Architecture: Transformer decoder
- Precision: bfloat16
- Base Model: Qwen/Qwen3-4B-Thinking-2507
- Parameters: 4 billion
- Training Data: lm-provers/FineProofs-SFT (4,300 unique problems)
- Teacher Model: deepseek-ai/DeepSeek-Math-V2 (685B)
Training Hyperparameters
- Optimization Steps: 620
- Global Batch Size: 32
- Learning Rate Schedule: Cosine with peak at 3×10⁻⁵
- Max Sequence Length: 45,056 tokens
- Training Configuration: Unique problems only (4.3k samples outperformed 7.7k with duplicates)
Software & hardware
- GPUs: 1 node with 8 H100
- Training time: 5 hours
- Training Framework: TRL (Transformer Reinforcement Learning)
- Evaluation framework: CMU-AIRe/QED-Nano
Training script
You can reproduce this training using the TRL library:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-4B-Thinking-2507",
torch_dtype="auto",
attn_implementation="kernels-community/vllm-flash-attn3",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")
# Load dataset
dataset = load_dataset("lm-provers/Olympiads-SFT", split="train")
# Training configuration
training_args = SFTConfig(
output_dir="./qed-nano-sft",
max_steps=620,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
learning_rate=3.0e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
max_seq_length=45056,
bf16=True,
gradient_checkpointing=True,
logging_steps=1,
save_steps=62,
push_to_hub=True,
)
# Train
trainer = SFTTrainer(model=model, tokenizer=tokenizer, args=training_args, train_dataset=dataset)
trainer.train()
Limitations
QED-Nano SFT is a supervised fine-tuning checkpoint with several important limitations:
Critical Limitations
Length Explosion: This SFT-only model suffers from uncontrolled reasoning length growth. It frequently generates responses exceeding 100k tokens, often meandering or repeating steps without converging to a solution. This is a known limitation of pure supervised learning on long-form reasoning tasks.
Correctness Issues: While the model produces fluent mathematical text, it may generate incorrect proofs. The SFT stage does not directly optimize for correctness—only for imitating the teacher's reasoning style.
No Self-Correction: Unlike the full QED-Nano model with RL training, this checkpoint lacks the ability to verify and refine its solutions.
Other Limitations
- Domain-specific: Designed specifically for theorem proving. Using as a general assistant will likely produce nonsense outside of this domain.
- Language: English only
- Text-Only: Cannot handle problems with diagrams or visual elements
- Teacher Errors: May have inherited some errors from the teacher model (DeepSeek-Math-V2)
Recommended Mitigations
- Use the full lm-provers/QED-Nano model for production use
- Apply length penalties or early stopping to control generation length
- Combine with external verification tools or judges
- Consider RL fine-tuning to address correctness and length issues
These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
License
Citation
If you use this model in your research, please cite:
@model{qed_nano_sft_2026,
title={QED-Nano SFT: Distilling Olympiad Mathematical Reasoning to 4B Parameters},
author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/lm-provers/QED-Nano-SFT}}
}
Please also cite the training dataset:
@dataset{fineproofs_sft_dataset_2026,
title={FineProofs SFT Dataset: Mathematical Olympiad Problems with Chain-of-Thought Reasoning},
author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/lm-provers/FineProofs-SFT}}
}
- Downloads last month
- 11
