File size: 5,259 Bytes
3b83fcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
base_model: mistralai/Mathstral-7B-v0.1
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- peft
- grpo
- dr-grpo
- mathematical-reasoning
- math
datasets:
- hugruby/mismatched-wrong-drafts
language:
- en
---

# Mathstral-7B · Mismatched × Wrong drafts

**Headline model** — the mismatched-wrong configuration from the paper.

LoRA adapter for **`mistralai/Mathstral-7B-v0.1`**, trained with **Dr. GRPO** as the **Mismatched × Wrong drafts** condition in *"Weak-to-Strong Elicitation via Mismatched Wrong Drafts"* (Wei Deng, [arXiv:2605.17314](https://arxiv.org/abs/2605.17314)).

- **Base model:** `mistralai/Mathstral-7B-v0.1` (Apache-2.0)
- **Draft model:** `Qwen/Qwen2.5-Math-1.5B` (writes the training-time draft)
- **Adapter:** LoRA r=16, α=32 (167 MB), released at **global step 2000**
- **Training data:** config `mismatched_wrong` of [`hugruby/mismatched-wrong-drafts`](https://huggingface.co/datasets/hugruby/mismatched-wrong-drafts) — 8,888 Level 3–5 MATH problems (**MATH-500 held out**)
- **License:** Apache-2.0

## How to use

This is a **LoRA adapter** — load it on top of the base model.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE    = "mistralai/Mathstral-7B-v0.1"
ADAPTER = "hugruby/mathstral-7b-mismatched-wrong-drafts"

tok   = AutoTokenizer.from_pretrained(ADAPTER)
model = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
    ADAPTER,
).eval()

problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
gen = dict(max_new_tokens=4096, do_sample=False)

# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
PROMPT = (
    "Problem: " + problem + "\n\n"
    "Thinking: N/A\n\n"
    "The thinking section may contain errors. Solve the math problem step by step. "
    "Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
    "Correct Solution:"
)
ids = tok(PROMPT, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
```

### Optional: the `[INST]` chat format (out-of-distribution)

The shipped `chat_template.jinja` is Mathstral's original `[INST]` chat template. This adapter was **not** trained in that format, so `apply_chat_template(...)` is **out-of-distribution** and generally underperforms the plain prompt above — it is included only so you can A/B both:

```python
ids = tok.apply_chat_template(
    [{"role": "user",
      "content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
    add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))
```

## How it was trained

Trained with **Dr. GRPO** (`loss_type=dr_grpo`, `scale_rewards=False`) using TRL `GRPOTrainer` on top of Unsloth `FastLanguageModel`, on the `mismatched_wrong` data config. The reward is binary `mathematically_quasi_correct`. The correction-bonus, copy-penalty, and corrupt-penalty terms are all **0**, and the reward is pure binary.

Training command:

```bash
python scripts/train.py \
  --model mistralai/Mathstral-7B-v0.1 \
  --dataset-path data/mismatched_wrong \
  --output-dir outputs/mismatched_wrong \
  --max-steps 2222 \
  --gradient-accumulation-steps 4 \
  --max-completion-length 4096 \
  --max-seq-length 7168 \
  --max-prompt-tokens 3072 \
  --learning-rate 5e-6 --lr-scheduler-type constant \
  --beta 0 \
  --correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
  --adam-beta2 0.99 \
  --save-steps 50 --gpu-mem-util 0.5
```

| Hyperparameter | Value |
|---|---|
| Base model | `mistralai/Mathstral-7B-v0.1` |
| Method | Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`) |
| LoRA rank / alpha | **r = 16, α = 32** → scaling **γ = α/r = 2** |
| LoRA targets / dropout | `q,k,v,o,gate,up,down` (7 projections) / 0.0 |
| KL coefficient β | 0 |
| Reward bonuses | correction 0, copy-penalty 0, corrupt-penalty 0 |
| Generations per prompt | 16 |
| Per-device batch | 1 |
| Gradient accumulation | 4 → 4 problems × 16 = 64 completions/step |
| Learning rate | 5e-6, **constant** schedule |
| Adam β₂ | 0.99 |
| Max completion length | 4096 |
| Max sequence length | 7168 |
| Max prompt tokens | 3072 |
| Max steps | 2222 |
| **Released checkpoint** | **global step 2000** (epoch 0.900) |
| Random seed | 42 |

## Files

- `adapter_model.safetensors`, `adapter_config.json` — the LoRA adapter (load with PEFT on the base model)
- `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json`, `special_tokens_map.json` — tokenizer
- `chat_template.jinja` — Mathstral's `[INST]` template (see the out-of-distribution note above)

## Citation

```bibtex
@article{deng2026mismatched,
  title   = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
  author  = {Deng, Wei},
  journal = {arXiv preprint arXiv:2605.17314},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.17314}
}
```

## License

Apache-2.0. The base model (`Mathstral-7B-v0.1`) and the draft model (`Qwen2.5-Math-1.5B`) are both Apache-2.0.