--- license: apache-2.0 base_model: mistralai/Mathstral-7B-v0.1 library_name: peft pipeline_tag: text-generation tags: - lora - peft - grpo - dr-grpo - mathematical-reasoning - math datasets: - hugruby/mismatched-wrong-drafts language: - en --- # Mathstral-7B · No draft (GRPO baseline) On-policy Dr. GRPO baseline — no draft injected. LoRA adapter for **`mistralai/Mathstral-7B-v0.1`**, trained with **Dr. GRPO** as the **No draft (GRPO baseline)** condition in *"Weak-to-Strong Elicitation via Mismatched Wrong Drafts"* (Wei Deng, [arXiv:2605.17314](https://arxiv.org/abs/2605.17314)). - **Base model:** `mistralai/Mathstral-7B-v0.1` (Apache-2.0) - **Adapter:** LoRA r=16, α=32 (167 MB), released at **global step 2000** - **Training data:** config `no_draft` of [`hugruby/mismatched-wrong-drafts`](https://huggingface.co/datasets/hugruby/mismatched-wrong-drafts) — 8,888 Level 3–5 MATH problems (**MATH-500 held out**) - **License:** Apache-2.0 ## How to use This is a **LoRA adapter** — load it on top of the base model. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel BASE = "mistralai/Mathstral-7B-v0.1" ADAPTER = "hugruby/mathstral-7b-grpo-no-draft" tok = AutoTokenizer.from_pretrained(ADAPTER) model = PeftModel.from_pretrained( AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"), ADAPTER, ).eval() problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$." gen = dict(max_new_tokens=4096, do_sample=False) # CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]): PROMPT = ( "Problem: " + problem + "\n\n" "Thinking: N/A\n\n" "The thinking section may contain errors. Solve the math problem step by step. " "Write your own correct solution. Put your final answer within \\boxed{}.\n\n" "Correct Solution:" ) ids = tok(PROMPT, return_tensors="pt").to(model.device) print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True)) ``` ### Optional: the `[INST]` chat format (out-of-distribution) The shipped `chat_template.jinja` is Mathstral's original `[INST]` chat template. This adapter was **not** trained in that format, so `apply_chat_template(...)` is **out-of-distribution** and generally underperforms the plain prompt above — it is included only so you can A/B both: ```python ids = tok.apply_chat_template( [{"role": "user", "content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}], add_generation_prompt=True, return_tensors="pt").to(model.device) print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True)) ``` ## How it was trained Trained with **Dr. GRPO** (`loss_type=dr_grpo`, `scale_rewards=False`) using TRL `GRPOTrainer` on top of Unsloth `FastLanguageModel`, on the `no_draft` data config. The reward is binary `mathematically_quasi_correct`. The correction-bonus, copy-penalty, and corrupt-penalty terms are all **0**, and the reward is pure binary. Training command: ```bash python scripts/train.py \ --model mistralai/Mathstral-7B-v0.1 \ --dataset-path data/no_draft \ --output-dir outputs/no_draft \ --max-steps 2222 \ --gradient-accumulation-steps 4 \ --max-completion-length 4096 \ --max-seq-length 7168 \ --learning-rate 5e-6 --lr-scheduler-type constant \ --beta 0 \ --correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \ --adam-beta2 0.99 \ --save-steps 50 --gpu-mem-util 0.5 ``` | Hyperparameter | Value | |---|---| | Base model | `mistralai/Mathstral-7B-v0.1` | | Method | Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`) | | LoRA rank / alpha | **r = 16, α = 32** → scaling **γ = α/r = 2** | | LoRA targets / dropout | `q,k,v,o,gate,up,down` (7 projections) / 0.0 | | KL coefficient β | 0 | | Reward bonuses | correction 0, copy-penalty 0, corrupt-penalty 0 | | Generations per prompt | 16 | | Per-device batch | 1 | | Gradient accumulation | 4 → 4 problems × 16 = 64 completions/step | | Learning rate | 5e-6, **constant** schedule | | Adam β₂ | 0.99 | | Max completion length | 4096 | | Max sequence length | 7168 | | Max prompt tokens | Disabled and no truncation. Since no drafts, prompts are short — longest 1,899 tok, well under the implicit 3,072 = 7,168 − 4,096 budget. As a result, disabling is equivalent to setting it to 3,072. | | Max steps | 2222 | | **Released checkpoint** | **global step 2000** (epoch 0.900) | | Random seed | 42 | ## Files - `adapter_model.safetensors`, `adapter_config.json` — the LoRA adapter (load with PEFT on the base model) - `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json`, `special_tokens_map.json` — tokenizer - `chat_template.jinja` — Mathstral's `[INST]` template (see the out-of-distribution note above) ## Citation ```bibtex @article{deng2026mismatched, title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts}, author = {Deng, Wei}, journal = {arXiv preprint arXiv:2605.17314}, year = {2026}, url = {https://arxiv.org/abs/2605.17314} } ``` ## License Apache-2.0. The base model (`Mathstral-7B-v0.1`) and the draft model (`Qwen2.5-Math-1.5B`) are both Apache-2.0.