Text Generation
PEFT
Safetensors
English
lora
grpo
dr-grpo
mathematical-reasoning
math
conversational
Instructions to use hugruby/mathstral-7b-mismatched-correct-drafts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use hugruby/mathstral-7b-mismatched-correct-drafts with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mathstral-7B-v0.1") model = PeftModel.from_pretrained(base_model, "hugruby/mathstral-7b-mismatched-correct-drafts") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: mistralai/Mathstral-7B-v0.1 | |
| library_name: peft | |
| pipeline_tag: text-generation | |
| tags: | |
| - lora | |
| - peft | |
| - grpo | |
| - dr-grpo | |
| - mathematical-reasoning | |
| - math | |
| datasets: | |
| - hugruby/mismatched-wrong-drafts | |
| language: | |
| - en | |
| # Mathstral-7B · Mismatched × Correct drafts | |
| Ablation — a *correct* draft **mismatched** to a different problem (the 2×2's fourth cell). | |
| LoRA adapter for **`mistralai/Mathstral-7B-v0.1`**, trained with **Dr. GRPO** as the **Mismatched × Correct drafts** condition in *"Weak-to-Strong Elicitation via Mismatched Wrong Drafts"* (Wei Deng, [arXiv:2605.17314](https://arxiv.org/abs/2605.17314)). | |
| - **Base model:** `mistralai/Mathstral-7B-v0.1` (Apache-2.0) | |
| - **Draft model:** `Qwen/Qwen2.5-Math-1.5B` (writes the training-time draft) | |
| - **Adapter:** LoRA r=16, α=32 (167 MB), released at **global step 2000** | |
| - **Training data:** config `mismatched_correct` of [`hugruby/mismatched-wrong-drafts`](https://huggingface.co/datasets/hugruby/mismatched-wrong-drafts) — 8,888 Level 3–5 MATH problems (**MATH-500 held out**) | |
| - **License:** Apache-2.0 | |
| ## How to use | |
| This is a **LoRA adapter** — load it on top of the base model. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| BASE = "mistralai/Mathstral-7B-v0.1" | |
| ADAPTER = "hugruby/mathstral-7b-mismatched-correct-drafts" | |
| tok = AutoTokenizer.from_pretrained(ADAPTER) | |
| model = PeftModel.from_pretrained( | |
| AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"), | |
| ADAPTER, | |
| ).eval() | |
| problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$." | |
| gen = dict(max_new_tokens=4096, do_sample=False) | |
| # CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]): | |
| PROMPT = ( | |
| "Problem: " + problem + "\n\n" | |
| "Thinking: N/A\n\n" | |
| "The thinking section may contain errors. Solve the math problem step by step. " | |
| "Write your own correct solution. Put your final answer within \\boxed{}.\n\n" | |
| "Correct Solution:" | |
| ) | |
| ids = tok(PROMPT, return_tensors="pt").to(model.device) | |
| print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Optional: the `[INST]` chat format (out-of-distribution) | |
| The shipped `chat_template.jinja` is Mathstral's original `[INST]` chat template. This adapter was **not** trained in that format, so `apply_chat_template(...)` is **out-of-distribution** and generally underperforms the plain prompt above — it is included only so you can A/B both: | |
| ```python | |
| ids = tok.apply_chat_template( | |
| [{"role": "user", | |
| "content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}], | |
| add_generation_prompt=True, return_tensors="pt").to(model.device) | |
| print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ## How it was trained | |
| Trained with **Dr. GRPO** (`loss_type=dr_grpo`, `scale_rewards=False`) using TRL `GRPOTrainer` on top of Unsloth `FastLanguageModel`, on the `mismatched_correct` data config. The reward is binary `mathematically_quasi_correct`. The correction-bonus, copy-penalty, and corrupt-penalty terms are all **0**, and the reward is pure binary. | |
| Training command: | |
| ```bash | |
| python scripts/train.py \ | |
| --model mistralai/Mathstral-7B-v0.1 \ | |
| --dataset-path data/mismatched_correct \ | |
| --output-dir outputs/mismatched_correct \ | |
| --max-steps 2222 \ | |
| --gradient-accumulation-steps 4 \ | |
| --max-completion-length 4096 \ | |
| --max-seq-length 8192 \ | |
| --learning-rate 5e-6 --lr-scheduler-type constant \ | |
| --beta 0 \ | |
| --correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \ | |
| --adam-beta2 0.99 \ | |
| --save-steps 50 --gpu-mem-util 0.5 | |
| ``` | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Base model | `mistralai/Mathstral-7B-v0.1` | | |
| | Method | Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`) | | |
| | LoRA rank / alpha | **r = 16, α = 32** → scaling **γ = α/r = 2** | | |
| | LoRA targets / dropout | `q,k,v,o,gate,up,down` (7 projections) / 0.0 | | |
| | KL coefficient β | 0 | | |
| | Reward bonuses | correction 0, copy-penalty 0, corrupt-penalty 0 | | |
| | Generations per prompt | 16 | | |
| | Per-device batch | 1 | | |
| | Gradient accumulation | 4 → 4 problems × 16 = 64 completions/step | | |
| | Learning rate | 5e-6, **constant** schedule | | |
| | Adam β₂ | 0.99 | | |
| | Max completion length | 4096 | | |
| | Max sequence length \* | 8192 | | |
| | Max prompt tokens \* | — (disabled, no truncation; longest prompt 3,317 tok < 8,192 − 4,096, so the 4,096 max completion length is respected) | | |
| | Max steps | 2222 | | |
| | **Released checkpoint** | **global step 2000** (epoch 0.900) | | |
| | Random seed | 42 | | |
| \* **Length budgets across all four variants:** | |
| | Variant | max-seq-length | max-completion | max-prompt-tokens | | |
| |---|---|---|---| | |
| | mismatched-wrong | 7168 | 4096 | 3072 | | |
| | matched-wrong | 7168 | 4096 | 3072 | | |
| | no-draft | 7168 | 4096 | disabled (equivalent to 3,072, as all prompts are short) | | |
| | **mismatched-correct** | 8192 | 4096 | disabled | | |
| For a strict apple-to-apple comparison, `mismatched-correct` should have used `--max-seq-length 7168` and `--max-prompt-tokens 3072` like the other three variants; the larger 8,192 with the cap left off was an omission. The effect should be negligible though — only **6 of 8,888** prompts exceed 3,072 tokens (longest 3,317), so for the other 8,882 the run is identical to a 7,168 / 3,072 setup. For those 6 the prompt is left untruncated, but the 4,096 max-completion length is still respected and the sequence runs only slightly past 7,168 (at most 3,317 + 4,096 = 7,413, well under 8,192). But to train a precise apple-to-apple version yourself, change `--max-seq-length 8192` to `7168` and add `--max-prompt-tokens 3072`. | |
| ## Files | |
| - `adapter_model.safetensors`, `adapter_config.json` — the LoRA adapter (load with PEFT on the base model) | |
| - `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json`, `special_tokens_map.json` — tokenizer | |
| - `chat_template.jinja` — Mathstral's `[INST]` template (see the out-of-distribution note above) | |
| ## Citation | |
| ```bibtex | |
| @article{deng2026mismatched, | |
| title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts}, | |
| author = {Deng, Wei}, | |
| journal = {arXiv preprint arXiv:2605.17314}, | |
| year = {2026}, | |
| url = {https://arxiv.org/abs/2605.17314} | |
| } | |
| ``` | |
| ## License | |
| Apache-2.0. The base model (`Mathstral-7B-v0.1`) and the draft model (`Qwen2.5-Math-1.5B`) are both Apache-2.0. | |