Instructions to use hugruby/mathstral-7b-mismatched-wrong-drafts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use hugruby/mathstral-7b-mismatched-wrong-drafts with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mathstral-7B-v0.1") model = PeftModel.from_pretrained(base_model, "hugruby/mathstral-7b-mismatched-wrong-drafts") - Notebooks
- Google Colab
- Kaggle
license: apache-2.0
base_model: mistralai/Mathstral-7B-v0.1
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- peft
- grpo
- dr-grpo
- mathematical-reasoning
- math
datasets:
- hugruby/mismatched-wrong-drafts
language:
- en
Mathstral-7B · Mismatched × Wrong drafts
Headline model — the mismatched-wrong configuration from the paper.
LoRA adapter for mistralai/Mathstral-7B-v0.1, trained with Dr. GRPO as the Mismatched × Wrong drafts condition in "Weak-to-Strong Elicitation via Mismatched Wrong Drafts" (Wei Deng, arXiv:2605.17314).
- Base model:
mistralai/Mathstral-7B-v0.1(Apache-2.0) - Draft model:
Qwen/Qwen2.5-Math-1.5B(writes the training-time draft) - Adapter: LoRA r=16, α=32 (167 MB), released at global step 2000
- Training data: config
mismatched_wrongofhugruby/mismatched-wrong-drafts— 8,888 Level 3–5 MATH problems (MATH-500 held out) - License: Apache-2.0
How to use
This is a LoRA adapter — load it on top of the base model.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "mistralai/Mathstral-7B-v0.1"
ADAPTER = "hugruby/mathstral-7b-mismatched-wrong-drafts"
tok = AutoTokenizer.from_pretrained(ADAPTER)
model = PeftModel.from_pretrained(
AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
ADAPTER,
).eval()
problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
gen = dict(max_new_tokens=4096, do_sample=False)
# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
PROMPT = (
"Problem: " + problem + "\n\n"
"Thinking: N/A\n\n"
"The thinking section may contain errors. Solve the math problem step by step. "
"Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
"Correct Solution:"
)
ids = tok(PROMPT, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
Optional: the [INST] chat format (out-of-distribution)
The shipped chat_template.jinja is Mathstral's original [INST] chat template. This adapter was not trained in that format, so apply_chat_template(...) is out-of-distribution and generally underperforms the plain prompt above — it is included only so you can A/B both:
ids = tok.apply_chat_template(
[{"role": "user",
"content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))
How it was trained
Trained with Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) using TRL GRPOTrainer on top of Unsloth FastLanguageModel, on the mismatched_wrong data config. The reward is binary mathematically_quasi_correct. The correction-bonus, copy-penalty, and corrupt-penalty terms are all 0, and the reward is pure binary.
Training command:
python scripts/train.py \
--model mistralai/Mathstral-7B-v0.1 \
--dataset-path data/mismatched_wrong \
--output-dir outputs/mismatched_wrong \
--max-steps 2222 \
--gradient-accumulation-steps 4 \
--max-completion-length 4096 \
--max-seq-length 7168 \
--max-prompt-tokens 3072 \
--learning-rate 5e-6 --lr-scheduler-type constant \
--beta 0 \
--correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
--adam-beta2 0.99 \
--save-steps 50 --gpu-mem-util 0.5
| Hyperparameter | Value |
|---|---|
| Base model | mistralai/Mathstral-7B-v0.1 |
| Method | Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) |
| LoRA rank / alpha | r = 16, α = 32 → scaling γ = α/r = 2 |
| LoRA targets / dropout | q,k,v,o,gate,up,down (7 projections) / 0.0 |
| KL coefficient β | 0 |
| Reward bonuses | correction 0, copy-penalty 0, corrupt-penalty 0 |
| Generations per prompt | 16 |
| Per-device batch | 1 |
| Gradient accumulation | 4 → 4 problems × 16 = 64 completions/step |
| Learning rate | 5e-6, constant schedule |
| Adam β₂ | 0.99 |
| Max completion length | 4096 |
| Max sequence length | 7168 |
| Max prompt tokens | 3072 |
| Max steps | 2222 |
| Released checkpoint | global step 2000 (epoch 0.900) |
| Random seed | 42 |
Files
adapter_model.safetensors,adapter_config.json— the LoRA adapter (load with PEFT on the base model)tokenizer.json,tokenizer.model,tokenizer_config.json,special_tokens_map.json— tokenizerchat_template.jinja— Mathstral's[INST]template (see the out-of-distribution note above)
Citation
@article{deng2026mismatched,
title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
author = {Deng, Wei},
journal = {arXiv preprint arXiv:2605.17314},
year = {2026},
url = {https://arxiv.org/abs/2605.17314}
}
License
Apache-2.0. The base model (Mathstral-7B-v0.1) and the draft model (Qwen2.5-Math-1.5B) are both Apache-2.0.