File size: 6,589 Bytes
df2c438 bdacc3d df2c438 2ec20cf df2c438 7773d9b 2ec20cf 7773d9b df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf df2c438 2ec20cf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
library_name: transformers
tags: []
---
# Model Card for _Qwen2.5-0.5B-Instruct (Fine-Tuned on OpenR1-Math-220k, 2% Done, 50% underway Feb 13th)_
## Model Details
**Model Name**: Qwen2.5-0.5B-Instruct (GRPO Fine-Tuned)
**Model ID**: `_Qwen2.5-0.5B-R1subset_`
**License**: [Apache 2.0 / or whichever applies]
**Finetuned From**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
**Language(s)**: English (mathematical text)
**Developed By**: Christian H. Cooper
**Funding**: Self-sponsored
**Shared By**: Christian H. Cooper
### Model Description
This model is a **Qwen2.5-0.5B** base LLM fine-tuned on a **2% subset** of the [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset. I used **Group Relative Policy Optimization (GRPO)** from the `trl` library, guiding the model toward producing well-formatted chain-of-thought answers in:
```
<reasoning>
...
</reasoning>
<answer>
...
</answer>
```
It focuses on math reasoning tasks, learning to generate a step-by-step solution (`<reasoning>`) and a numeric or final textual answer (`<answer>`). We incorporate reward functions that encourage correct chain-of-thought structure, numeric answers, and correctness.
### Model Sources
- **GitHub or Repo**: *[Pending]*
- **Paper/Demo**: *[Pending]*
## Uses
### Direct Use
- **Math Problem Solving**: The model tries to reason through math word problems, providing step-by-step reasoning and a final answer.
### Downstream Use
- **Educational Tools**: Potentially used in tutoring or step-by-step solution generation.
- **Math Chatbots**: A math helper that can respond in a structured `<reasoning>/<answer>` format.
### Out-of-Scope Use
- **High-Stakes Decisions**: Model is not guaranteed to be correct for advanced or critical math scenarios (finance, medical, engineering safety).
- **Non-English**: Primary training data is English math text, so reliability in other languages is minimal.
## Bias, Risks, and Limitations
- **Bias**: Although this is a math-focused dataset, any language model can exhibit unintended biases.
- **Risks**: The model may produce mathematically incorrect or incomplete solutions. The partial coverage (2% of the dataset) further limits accuracy.
- **Limitations**:
- Only partially fine-tuned on 2% of the data, so correctness is not guaranteed.
- The chain-of-thought is for interpretability but may still contain flawed reasoning or leaps.
## How to Get Started
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "HarleyCooper/Qwen.5B-OpenR1Math" # Will keep the same name through all % iterations.
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
prompt = """<reasoning>
Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point.
Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.
</reasoning>
<answer>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2000)
answer = tokenizer.decode(outputs[0])
print(answer)
```
## Training Details
### Training Data
- **Dataset**: A 2% subsample (~4.4k problems) of [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k).
- **Data Format**: Each sample has `problem`, `solution`, `answer`. We transform them into:
- `"prompt"`: A single string containing system instructions + the problem text.
- `"answer"`: A string with `<reasoning>` + `<answer>` blocks.
### Training Procedure
- **Framework**: [TRL (v0.4+)](https://github.com/lvwerra/trl) with Group Relative Policy Optimization (GRPO).
- **Objective**: Reinforcement learning on chain-of-thought format, numeric correctness, and final-answer consistency.
- **Reward Functions**:
1. **`xmlcount_reward_func`**: Encourages `<reasoning>`/`<answer>` structure.
2. **`soft_format_reward_func`**: Checks for `<reasoning>.*</reasoning><answer>.*</answer>` in any multiline arrangement.
3. **`strict_format_reward_func`**: Strict multiline regex for exact formatting.
4. **`int_reward_func`**: Partial reward if the final `<answer>` is purely numeric.
5. **`correctness_reward_func`**: Binary reward if the final extracted answer exactly matches the known correct answer.
#### Training Hyperparameters
- **Base Model**: Qwen2.5-0.5B
- **Learning Rate**: ~5e-6
- **Batch Size**: 1–2 (due to GPU constraints)
- **Optimizer**: AdamW (β1=0.9, β2=0.99)
- **Scheduler**: Cosine with warmup_ratio=0.1
- **Num Generations**: 16 (GRPO config)
- **Number of Training Epochs**: 1 epoch on 2% data
- **Hardware**: Single A100 40GB on Colab
- **Max Prompt Length**: 256 tokens
- **Max Completion Length**: 200 tokens
### Speeds, Sizes, Times
- **Approx. Steps**: ~200–300 steps for 2% subset
- **Run Time**: Varies from ~1 to 2 hours on Colab A100
## Evaluation
### Testing Data
- Currently trained + tested on the same subset (2%). Next step would be to evaluate on a withheld portion or the full set to measure true correctness.
### Metrics
- **Format Rewards**: `xmlcount`, `soft_format`, `strict_format`
- **Correctness**: Exact match final numeric/string answer
- **Partial Numeric**: `int_reward_func`
### Results
- The model shows a strong improvement in output format (70–80% format compliance) but relatively low exact numeric correctness. Additional epochs or a larger training fraction are needed for better correctness.
## Environmental Impact
- **Hardware**: Single A100 40GB GPU in a Colab environment
- **Train Time**: ~1–2 hours on 2% data
- **Carbon Footprint**: Not measured exactly, but minimal compared to large-scale runs
## Model Architecture & Objective
- **Architecture**: Transformer-based causal language model (Qwen2.5-0.5B)
- **Objective**: RL-based chain-of-thought generation for math reasoning
## Citation
```
@misc{cooperQwen2.5-0.5B,
title={Qwen2.5-0.5B Fine-Tuned on OpenR1 (2% subset)},
author={Christian H. Cooper.},
howpublished={\url{https://huggingface.co/Christian-cooper-us/Qwen2.5-0.5B-R1subset}},
year={2025},
}
```
## Contact
- Maintainers: Christian Cooper (GitHub: [@christian-cooper-us](https://huggingface.co/HarleyCooper)), others.
---
**Disclaimer**: This model is experimental, trained on only 2% of the dataset. It may produce inaccurate math solutions and is not suitable for high-stakes or time-sensitive deployments. |