Text Generation
PEFT
Safetensors
English
lora
grpo
dr-grpo
mathematical-reasoning
math
conversational
Instructions to use hugruby/mathstral-7b-mismatched-correct-drafts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use hugruby/mathstral-7b-mismatched-correct-drafts with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mathstral-7B-v0.1") model = PeftModel.from_pretrained(base_model, "hugruby/mathstral-7b-mismatched-correct-drafts") - Notebooks
- Google Colab
- Kaggle
File size: 6,471 Bytes
4e0ad6f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ---
license: apache-2.0
base_model: mistralai/Mathstral-7B-v0.1
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- peft
- grpo
- dr-grpo
- mathematical-reasoning
- math
datasets:
- hugruby/mismatched-wrong-drafts
language:
- en
---
# Mathstral-7B · Mismatched × Correct drafts
Ablation — a *correct* draft **mismatched** to a different problem (the 2×2's fourth cell).
LoRA adapter for **`mistralai/Mathstral-7B-v0.1`**, trained with **Dr. GRPO** as the **Mismatched × Correct drafts** condition in *"Weak-to-Strong Elicitation via Mismatched Wrong Drafts"* (Wei Deng, [arXiv:2605.17314](https://arxiv.org/abs/2605.17314)).
- **Base model:** `mistralai/Mathstral-7B-v0.1` (Apache-2.0)
- **Draft model:** `Qwen/Qwen2.5-Math-1.5B` (writes the training-time draft)
- **Adapter:** LoRA r=16, α=32 (167 MB), released at **global step 2000**
- **Training data:** config `mismatched_correct` of [`hugruby/mismatched-wrong-drafts`](https://huggingface.co/datasets/hugruby/mismatched-wrong-drafts) — 8,888 Level 3–5 MATH problems (**MATH-500 held out**)
- **License:** Apache-2.0
## How to use
This is a **LoRA adapter** — load it on top of the base model.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "mistralai/Mathstral-7B-v0.1"
ADAPTER = "hugruby/mathstral-7b-mismatched-correct-drafts"
tok = AutoTokenizer.from_pretrained(ADAPTER)
model = PeftModel.from_pretrained(
AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
ADAPTER,
).eval()
problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
gen = dict(max_new_tokens=4096, do_sample=False)
# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
PROMPT = (
"Problem: " + problem + "\n\n"
"Thinking: N/A\n\n"
"The thinking section may contain errors. Solve the math problem step by step. "
"Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
"Correct Solution:"
)
ids = tok(PROMPT, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
```
### Optional: the `[INST]` chat format (out-of-distribution)
The shipped `chat_template.jinja` is Mathstral's original `[INST]` chat template. This adapter was **not** trained in that format, so `apply_chat_template(...)` is **out-of-distribution** and generally underperforms the plain prompt above — it is included only so you can A/B both:
```python
ids = tok.apply_chat_template(
[{"role": "user",
"content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))
```
## How it was trained
Trained with **Dr. GRPO** (`loss_type=dr_grpo`, `scale_rewards=False`) using TRL `GRPOTrainer` on top of Unsloth `FastLanguageModel`, on the `mismatched_correct` data config. The reward is binary `mathematically_quasi_correct`. The correction-bonus, copy-penalty, and corrupt-penalty terms are all **0**, and the reward is pure binary.
Training command:
```bash
python scripts/train.py \
--model mistralai/Mathstral-7B-v0.1 \
--dataset-path data/mismatched_correct \
--output-dir outputs/mismatched_correct \
--max-steps 2222 \
--gradient-accumulation-steps 4 \
--max-completion-length 4096 \
--max-seq-length 8192 \
--learning-rate 5e-6 --lr-scheduler-type constant \
--beta 0 \
--correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
--adam-beta2 0.99 \
--save-steps 50 --gpu-mem-util 0.5
```
| Hyperparameter | Value |
|---|---|
| Base model | `mistralai/Mathstral-7B-v0.1` |
| Method | Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`) |
| LoRA rank / alpha | **r = 16, α = 32** → scaling **γ = α/r = 2** |
| LoRA targets / dropout | `q,k,v,o,gate,up,down` (7 projections) / 0.0 |
| KL coefficient β | 0 |
| Reward bonuses | correction 0, copy-penalty 0, corrupt-penalty 0 |
| Generations per prompt | 16 |
| Per-device batch | 1 |
| Gradient accumulation | 4 → 4 problems × 16 = 64 completions/step |
| Learning rate | 5e-6, **constant** schedule |
| Adam β₂ | 0.99 |
| Max completion length | 4096 |
| Max sequence length \* | 8192 |
| Max prompt tokens \* | — (disabled, no truncation; longest prompt 3,317 tok < 8,192 − 4,096, so the 4,096 max completion length is respected) |
| Max steps | 2222 |
| **Released checkpoint** | **global step 2000** (epoch 0.900) |
| Random seed | 42 |
\* **Length budgets across all four variants:**
| Variant | max-seq-length | max-completion | max-prompt-tokens |
|---|---|---|---|
| mismatched-wrong | 7168 | 4096 | 3072 |
| matched-wrong | 7168 | 4096 | 3072 |
| no-draft | 7168 | 4096 | disabled (equivalent to 3,072, as all prompts are short) |
| **mismatched-correct** | 8192 | 4096 | disabled |
For a strict apple-to-apple comparison, `mismatched-correct` should have used `--max-seq-length 7168` and `--max-prompt-tokens 3072` like the other three variants; the larger 8,192 with the cap left off was an omission. The effect should be negligible though — only **6 of 8,888** prompts exceed 3,072 tokens (longest 3,317), so for the other 8,882 the run is identical to a 7,168 / 3,072 setup. For those 6 the prompt is left untruncated, but the 4,096 max-completion length is still respected and the sequence runs only slightly past 7,168 (at most 3,317 + 4,096 = 7,413, well under 8,192). But to train a precise apple-to-apple version yourself, change `--max-seq-length 8192` to `7168` and add `--max-prompt-tokens 3072`.
## Files
- `adapter_model.safetensors`, `adapter_config.json` — the LoRA adapter (load with PEFT on the base model)
- `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json`, `special_tokens_map.json` — tokenizer
- `chat_template.jinja` — Mathstral's `[INST]` template (see the out-of-distribution note above)
## Citation
```bibtex
@article{deng2026mismatched,
title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
author = {Deng, Wei},
journal = {arXiv preprint arXiv:2605.17314},
year = {2026},
url = {https://arxiv.org/abs/2605.17314}
}
```
## License
Apache-2.0. The base model (`Mathstral-7B-v0.1`) and the draft model (`Qwen2.5-Math-1.5B`) are both Apache-2.0.
|