Qwen.5B-OpenR1Math / README.md

Update README.md

bdacc3d verified about 1 year ago

6.59 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for _Qwen2.5-0.5B-Instruct (Fine-Tuned on OpenR1-Math-220k, 2% Done, 50% underway Feb 13th)_

	## Model Details

	Model Name: Qwen2.5-0.5B-Instruct (GRPO Fine-Tuned)
	Model ID: `_Qwen2.5-0.5B-R1subset_`
	License: [Apache 2.0 / or whichever applies]
	Finetuned From: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
	Language(s): English (mathematical text)

	Developed By: Christian H. Cooper
	Funding: Self-sponsored
	Shared By: Christian H. Cooper

	### Model Description

	This model is a Qwen2.5-0.5B base LLM fine-tuned on a 2% subset of the [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset. I used Group Relative Policy Optimization (GRPO) from the `trl` library, guiding the model toward producing well-formatted chain-of-thought answers in:

	```
	<reasoning>
	...
	</reasoning>
	<answer>
	...
	</answer>
	```

	It focuses on math reasoning tasks, learning to generate a step-by-step solution (`<reasoning>`) and a numeric or final textual answer (`<answer>`). We incorporate reward functions that encourage correct chain-of-thought structure, numeric answers, and correctness.

	### Model Sources
	- GitHub or Repo: [Pending]
	- Paper/Demo: [Pending]

	## Uses

	### Direct Use
	- Math Problem Solving: The model tries to reason through math word problems, providing step-by-step reasoning and a final answer.

	### Downstream Use
	- Educational Tools: Potentially used in tutoring or step-by-step solution generation.
	- Math Chatbots: A math helper that can respond in a structured `<reasoning>/<answer>` format.

	### Out-of-Scope Use
	- High-Stakes Decisions: Model is not guaranteed to be correct for advanced or critical math scenarios (finance, medical, engineering safety).
	- Non-English: Primary training data is English math text, so reliability in other languages is minimal.

	## Bias, Risks, and Limitations

	- Bias: Although this is a math-focused dataset, any language model can exhibit unintended biases.
	- Risks: The model may produce mathematically incorrect or incomplete solutions. The partial coverage (2% of the dataset) further limits accuracy.
	- Limitations:
	- Only partially fine-tuned on 2% of the data, so correctness is not guaranteed.
	- The chain-of-thought is for interpretability but may still contain flawed reasoning or leaps.

	## How to Get Started

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "HarleyCooper/Qwen.5B-OpenR1Math" # Will keep the same name through all % iterations.
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

	prompt = """<reasoning>
	Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point.

	Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.
	</reasoning>
	<answer>
	"""

	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_new_tokens=2000)
	answer = tokenizer.decode(outputs[0])
	print(answer)
	```

	## Training Details

	### Training Data

	- Dataset: A 2% subsample (~4.4k problems) of [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k).
	- Data Format: Each sample has `problem`, `solution`, `answer`. We transform them into:
	- `"prompt"`: A single string containing system instructions + the problem text.
	- `"answer"`: A string with `<reasoning>` + `<answer>` blocks.

	### Training Procedure

	- Framework: [TRL (v0.4+)](https://github.com/lvwerra/trl) with Group Relative Policy Optimization (GRPO).
	- Objective: Reinforcement learning on chain-of-thought format, numeric correctness, and final-answer consistency.
	- Reward Functions:
	1. `xmlcount_reward_func`: Encourages `<reasoning>`/`<answer>` structure.
	2. `soft_format_reward_func`: Checks for `<reasoning>.</reasoning><answer>.</answer>` in any multiline arrangement.
	3. `strict_format_reward_func`: Strict multiline regex for exact formatting.
	4. `int_reward_func`: Partial reward if the final `<answer>` is purely numeric.
	5. `correctness_reward_func`: Binary reward if the final extracted answer exactly matches the known correct answer.

	#### Training Hyperparameters

	- Base Model: Qwen2.5-0.5B
	- Learning Rate: ~5e-6
	- Batch Size: 1–2 (due to GPU constraints)
	- Optimizer: AdamW (β1=0.9, β2=0.99)
	- Scheduler: Cosine with warmup_ratio=0.1
	- Num Generations: 16 (GRPO config)
	- Number of Training Epochs: 1 epoch on 2% data
	- Hardware: Single A100 40GB on Colab
	- Max Prompt Length: 256 tokens
	- Max Completion Length: 200 tokens

	### Speeds, Sizes, Times
	- Approx. Steps: ~200–300 steps for 2% subset
	- Run Time: Varies from ~1 to 2 hours on Colab A100

	## Evaluation

	### Testing Data
	- Currently trained + tested on the same subset (2%). Next step would be to evaluate on a withheld portion or the full set to measure true correctness.

	### Metrics
	- Format Rewards: `xmlcount`, `soft_format`, `strict_format`
	- Correctness: Exact match final numeric/string answer
	- Partial Numeric: `int_reward_func`

	### Results
	- The model shows a strong improvement in output format (70–80% format compliance) but relatively low exact numeric correctness. Additional epochs or a larger training fraction are needed for better correctness.

	## Environmental Impact

	- Hardware: Single A100 40GB GPU in a Colab environment
	- Train Time: ~1–2 hours on 2% data
	- Carbon Footprint: Not measured exactly, but minimal compared to large-scale runs

	## Model Architecture & Objective

	- Architecture: Transformer-based causal language model (Qwen2.5-0.5B)
	- Objective: RL-based chain-of-thought generation for math reasoning

	## Citation

	```
	@misc{cooperQwen2.5-0.5B,
	title={Qwen2.5-0.5B Fine-Tuned on OpenR1 (2% subset)},
	author={Christian H. Cooper.},
	howpublished={\url{https://huggingface.co/Christian-cooper-us/Qwen2.5-0.5B-R1subset}},
	year={2025},
	}
	```

	## Contact
	- Maintainers: Christian Cooper (GitHub: [@christian-cooper-us](https://huggingface.co/HarleyCooper)), others.

	---

	Disclaimer: This model is experimental, trained on only 2% of the dataset. It may produce inaccurate math solutions and is not suitable for high-stakes or time-sensitive deployments.