QED-Nano SFT

logo.png

Table of Contents

  1. Model Summary
  2. How to use
  3. Evaluation
  4. Training
  5. Limitations
  6. License

Model Summary

QED-Nano SFT is a 4B parameter mathematical reasoning model fine-tuned on olympiad-level proof problems. This is the supervised fine-tuning (SFT) checkpoint trained via knowledge distillation from deepseek-ai/DeepSeek-Math-V2 (685B parameters). The model generates mathematical proofs with explicit chain-of-thought reasoning enclosed in <think> tags.

This model serves as the foundation for the full QED-Nano pipeline, which includes subsequent reinforcement learning and reasoning cache extensions. For the complete trained model, see lm-provers/QED-Nano.

Key features

  • Reasoning model optimized for theorem proving
  • Supervised fine-tuning checkpoint: Foundation for the full QED-Nano model (SFT + RL + Reasoning Cache)
  • Distilled from DeepSeek-Math-V2 (685B): Knowledge distillation to 4B parameters
  • Fully open: open weights + full training details including public data mixture and training configs

For more details refer to our blog post: https://huggingface.co/spaces/lm-provers/qed-nano-blogpost

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lm-provers/QED-Nano-SFT"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# prepare the model input
prompt = "Generate a rigorous proof to the following question: is \sqrt{2} rational or irrational?"
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the output
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)

# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

We recommend setting temperature=0.6 and top_p=0.95 in the sampling parameters.

vLLM and SGLang

You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.

SGLang

python -m sglang.launch_server --model-path lm-provers/QED-Nano-SFT

vLLM

vllm serve lm-provers/QED-Nano-SFT

Evaluation

In this section, we report the evaluation results of QED-Nano on IMO-ProofBench, ProofBench, and IMO-AnswerBench. All evaluations except those on IMO-AnswerBench are reported as avg@3 unless stated otherwise.

Model IMO-ProofBench ProofBench IMO-AnswerBench
Qwen3-4B-Thinking-2507 20.4 (2.6) 19.5 (0.9) 55.8
QED-Nano-SFT 39.5 (2.9) 33.3 (0.5) 57.5
QED-Nano 40.0 (0.6) 44.9 (3.4) 67.5
QED-Nano (Agent) 54.0 (3.7) 54.4 (2.4) -
Qwen3-30B-A3B-Thinking-2507 27.6 (1.0) 26.1 (2.4) 67.0
Qwen3-235B-A22B-Thinking-2507 34.1 (0.7) 33.7 (1.1) 70.5
Nomos-1 40.3 (3.5) 28.3 (3.9) 49.0
GPT-OSS-20B 38.3 (1.2) 38.4 (3.9) 61.5
GPT-OSS-120B 43.1 (3.2) 47.5 (1.7) 70.5
DeepSeek-Math-V2 57.9 (2.0) 60.6 (0.1) 75.8
Gemini 3 Pro 58.7 (2.9) 66.7 (3.1) 83.2

Note: This is the SFT-only checkpoint. Performance improves significantly with RL training and reasoning cache. See the full lm-provers/QED-Nano model for complete results.

Training

Model

Training Hyperparameters

  • Optimization Steps: 620
  • Global Batch Size: 32
  • Learning Rate Schedule: Cosine with peak at 3×10⁻⁵
  • Max Sequence Length: 45,056 tokens
  • Training Configuration: Unique problems only (4.3k samples outperformed 7.7k with duplicates)

Software & hardware

Training script

You can reproduce this training using the TRL library:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")

# Load dataset
dataset = load_dataset("lm-provers/Olympiads-SFT", split="train")

# Training configuration
training_args = SFTConfig(
    output_dir="./qed-nano-sft",
    max_steps=620,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=3.0e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    max_seq_length=45056,
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=1,
    save_steps=62,
    push_to_hub=True,
)

# Train
trainer = SFTTrainer(model=model, tokenizer=tokenizer, args=training_args, train_dataset=dataset)
trainer.train()

Limitations

QED-Nano SFT is a supervised fine-tuning checkpoint with several important limitations:

Critical Limitations

  1. Length Explosion: This SFT-only model suffers from uncontrolled reasoning length growth. It frequently generates responses exceeding 100k tokens, often meandering or repeating steps without converging to a solution. This is a known limitation of pure supervised learning on long-form reasoning tasks.

  2. Correctness Issues: While the model produces fluent mathematical text, it may generate incorrect proofs. The SFT stage does not directly optimize for correctness—only for imitating the teacher's reasoning style.

  3. No Self-Correction: Unlike the full QED-Nano model with RL training, this checkpoint lacks the ability to verify and refine its solutions.

Other Limitations

  • Domain-specific: Designed specifically for theorem proving. Using as a general assistant will likely produce nonsense outside of this domain.
  • Language: English only
  • Text-Only: Cannot handle problems with diagrams or visual elements
  • Teacher Errors: May have inherited some errors from the teacher model (DeepSeek-Math-V2)

Recommended Mitigations

  • Use the full lm-provers/QED-Nano model for production use
  • Apply length penalties or early stopping to control generation length
  • Combine with external verification tools or judges
  • Consider RL fine-tuning to address correctness and length issues

These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.

License

Apache 2.0

Citation

If you use this model in your research, please cite:

@model{qed_nano_sft_2026,
  title={QED-Nano SFT: Distilling Olympiad Mathematical Reasoning to 4B Parameters},
  author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/lm-provers/QED-Nano-SFT}}
}

Please also cite the training dataset:

@dataset{fineproofs_sft_dataset_2026,
  title={FineProofs SFT Dataset: Mathematical Olympiad Problems with Chain-of-Thought Reasoning},
  author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/datasets/lm-provers/FineProofs-SFT}}
}
Downloads last month
11
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lm-provers/QED-Nano-SFT

Finetuned
(195)
this model
Finetunes
1 model

Dataset used to train lm-provers/QED-Nano-SFT

Collection including lm-provers/QED-Nano-SFT