Upload README.md with huggingface_hub

9856910 verified 1 day ago

5.28 kB

	---
	license: apache-2.0
	base_model: mistralai/Mathstral-7B-v0.1
	library_name: peft
	pipeline_tag: text-generation
	tags:
	- lora
	- peft
	- grpo
	- dr-grpo
	- mathematical-reasoning
	- math
	datasets:
	- hugruby/mismatched-wrong-drafts
	language:
	- en
	---

	# Mathstral-7B · No draft (GRPO baseline)

	On-policy Dr. GRPO baseline — no draft injected.

	LoRA adapter for `mistralai/Mathstral-7B-v0.1`, trained with Dr. GRPO as the No draft (GRPO baseline) condition in "Weak-to-Strong Elicitation via Mismatched Wrong Drafts" (Wei Deng, [arXiv:2605.17314](https://arxiv.org/abs/2605.17314)).

	- Base model: `mistralai/Mathstral-7B-v0.1` (Apache-2.0)
	- Adapter: LoRA r=16, α=32 (167 MB), released at global step 2000
	- Training data: config `no_draft` of [`hugruby/mismatched-wrong-drafts`](https://huggingface.co/datasets/hugruby/mismatched-wrong-drafts) — 8,888 Level 3–5 MATH problems (MATH-500 held out)
	- License: Apache-2.0

	## How to use

	This is a LoRA adapter — load it on top of the base model.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	BASE = "mistralai/Mathstral-7B-v0.1"
	ADAPTER = "hugruby/mathstral-7b-grpo-no-draft"

	tok = AutoTokenizer.from_pretrained(ADAPTER)
	model = PeftModel.from_pretrained(
	AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
	ADAPTER,
	).eval()

	problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
	gen = dict(max_new_tokens=4096, do_sample=False)

	# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
	PROMPT = (
	"Problem: " + problem + "\n\n"
	"Thinking: N/A\n\n"
	"The thinking section may contain errors. Solve the math problem step by step. "
	"Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
	"Correct Solution:"
	)
	ids = tok(PROMPT, return_tensors="pt").to(model.device)
	print(tok.decode(model.generate(ids, gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
	```

	### Optional: the `[INST]` chat format (out-of-distribution)

	The shipped `chat_template.jinja` is Mathstral's original `[INST]` chat template. This adapter was not trained in that format, so `apply_chat_template(...)` is out-of-distribution and generally underperforms the plain prompt above — it is included only so you can A/B both:

	```python
	ids = tok.apply_chat_template(
	[{"role": "user",
	"content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
	add_generation_prompt=True, return_tensors="pt").to(model.device)
	print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))
	```

	## How it was trained

	Trained with Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`) using TRL `GRPOTrainer` on top of Unsloth `FastLanguageModel`, on the `no_draft` data config. The reward is binary `mathematically_quasi_correct`. The correction-bonus, copy-penalty, and corrupt-penalty terms are all 0, and the reward is pure binary.

	Training command:

	```bash
	python scripts/train.py \
	--model mistralai/Mathstral-7B-v0.1 \
	--dataset-path data/no_draft \
	--output-dir outputs/no_draft \
	--max-steps 2222 \
	--gradient-accumulation-steps 4 \
	--max-completion-length 4096 \
	--max-seq-length 7168 \
	--learning-rate 5e-6 --lr-scheduler-type constant \
	--beta 0 \
	--correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
	--adam-beta2 0.99 \
	--save-steps 50 --gpu-mem-util 0.5
	```

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Base model \| `mistralai/Mathstral-7B-v0.1` \|
	\| Method \| Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`) \|
	\| LoRA rank / alpha \| r = 16, α = 32 → scaling γ = α/r = 2 \|
	\| LoRA targets / dropout \| `q,k,v,o,gate,up,down` (7 projections) / 0.0 \|
	\| KL coefficient β \| 0 \|
	\| Reward bonuses \| correction 0, copy-penalty 0, corrupt-penalty 0 \|
	\| Generations per prompt \| 16 \|
	\| Per-device batch \| 1 \|
	\| Gradient accumulation \| 4 → 4 problems × 16 = 64 completions/step \|
	\| Learning rate \| 5e-6, constant schedule \|
	\| Adam β₂ \| 0.99 \|
	\| Max completion length \| 4096 \|
	\| Max sequence length \| 7168 \|
	\| Max prompt tokens \| Disabled and no truncation. Since no drafts, prompts are short — longest 1,899 tok, well under the implicit 3,072 = 7,168 − 4,096 budget. As a result, disabling is equivalent to setting it to 3,072. \|
	\| Max steps \| 2222 \|
	\| Released checkpoint \| global step 2000 (epoch 0.900) \|
	\| Random seed \| 42 \|

	## Files

	- `adapter_model.safetensors`, `adapter_config.json` — the LoRA adapter (load with PEFT on the base model)
	- `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json`, `special_tokens_map.json` — tokenizer
	- `chat_template.jinja` — Mathstral's `[INST]` template (see the out-of-distribution note above)

	## Citation

	```bibtex
	@article{deng2026mismatched,
	title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
	author = {Deng, Wei},
	journal = {arXiv preprint arXiv:2605.17314},
	year = {2026},
	url = {https://arxiv.org/abs/2605.17314}
	}
	```

	## License

	Apache-2.0. The base model (`Mathstral-7B-v0.1`) and the draft model (`Qwen2.5-Math-1.5B`) are both Apache-2.0.