File size: 6,589 Bytes
df2c438
 
 
 
 
bdacc3d
df2c438
 
 
2ec20cf
 
 
 
 
df2c438
7773d9b
2ec20cf
7773d9b
df2c438
2ec20cf
df2c438
2ec20cf
df2c438
2ec20cf
 
 
 
 
 
 
 
df2c438
2ec20cf
df2c438
2ec20cf
 
 
df2c438
 
 
 
2ec20cf
df2c438
2ec20cf
 
 
df2c438
 
2ec20cf
 
df2c438
 
 
2ec20cf
 
 
 
 
df2c438
2ec20cf
df2c438
2ec20cf
 
 
df2c438
2ec20cf
 
 
df2c438
2ec20cf
 
df2c438
2ec20cf
 
 
 
df2c438
2ec20cf
 
 
 
 
df2c438
 
 
 
 
2ec20cf
 
 
 
df2c438
 
 
2ec20cf
 
 
 
 
 
 
 
df2c438
 
 
2ec20cf
 
 
 
 
 
 
 
 
 
 
 
 
 
df2c438
 
 
2ec20cf
 
df2c438
2ec20cf
 
 
 
df2c438
 
2ec20cf
df2c438
 
 
2ec20cf
 
 
df2c438
2ec20cf
df2c438
2ec20cf
 
df2c438
2ec20cf
df2c438
2ec20cf
 
 
 
 
 
 
 
df2c438
2ec20cf
 
df2c438
2ec20cf
df2c438
2ec20cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
library_name: transformers
tags: []
---

# Model Card for _Qwen2.5-0.5B-Instruct (Fine-Tuned on OpenR1-Math-220k, 2% Done, 50% underway Feb 13th)_

## Model Details

**Model Name**: Qwen2.5-0.5B-Instruct (GRPO Fine-Tuned)  
**Model ID**: `_Qwen2.5-0.5B-R1subset_`  
**License**: [Apache 2.0 / or whichever applies]  
**Finetuned From**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)  
**Language(s)**: English (mathematical text)

**Developed By**: Christian H. Cooper
**Funding**: Self-sponsored  
**Shared By**: Christian H. Cooper  

### Model Description

This model is a **Qwen2.5-0.5B** base LLM fine-tuned on a **2% subset** of the [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset. I used **Group Relative Policy Optimization (GRPO)** from the `trl` library, guiding the model toward producing well-formatted chain-of-thought answers in:

```
<reasoning>
  ...
</reasoning>
<answer>
  ...
</answer>
```

It focuses on math reasoning tasks, learning to generate a step-by-step solution (`<reasoning>`) and a numeric or final textual answer (`<answer>`). We incorporate reward functions that encourage correct chain-of-thought structure, numeric answers, and correctness.

### Model Sources
- **GitHub or Repo**: *[Pending]*  
- **Paper/Demo**: *[Pending]*

## Uses

### Direct Use
- **Math Problem Solving**: The model tries to reason through math word problems, providing step-by-step reasoning and a final answer.

### Downstream Use
- **Educational Tools**: Potentially used in tutoring or step-by-step solution generation.  
- **Math Chatbots**: A math helper that can respond in a structured `<reasoning>/<answer>` format.

### Out-of-Scope Use
- **High-Stakes Decisions**: Model is not guaranteed to be correct for advanced or critical math scenarios (finance, medical, engineering safety).  
- **Non-English**: Primary training data is English math text, so reliability in other languages is minimal.

## Bias, Risks, and Limitations

- **Bias**: Although this is a math-focused dataset, any language model can exhibit unintended biases.  
- **Risks**: The model may produce mathematically incorrect or incomplete solutions. The partial coverage (2% of the dataset) further limits accuracy.  
- **Limitations**: 
  - Only partially fine-tuned on 2% of the data, so correctness is not guaranteed.  
  - The chain-of-thought is for interpretability but may still contain flawed reasoning or leaps.

## How to Get Started

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HarleyCooper/Qwen.5B-OpenR1Math"  # Will keep the same name through all % iterations. 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

prompt = """<reasoning>
Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point.

Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.
</reasoning>
<answer>
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2000)
answer = tokenizer.decode(outputs[0])
print(answer)
```

## Training Details

### Training Data

- **Dataset**: A 2% subsample (~4.4k problems) of [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k).  
- **Data Format**: Each sample has `problem`, `solution`, `answer`. We transform them into:
  - `"prompt"`: A single string containing system instructions + the problem text.  
  - `"answer"`: A string with `<reasoning>` + `<answer>` blocks.

### Training Procedure

- **Framework**: [TRL (v0.4+)](https://github.com/lvwerra/trl) with Group Relative Policy Optimization (GRPO).  
- **Objective**: Reinforcement learning on chain-of-thought format, numeric correctness, and final-answer consistency.  
- **Reward Functions**:
  1. **`xmlcount_reward_func`**: Encourages `<reasoning>`/`<answer>` structure.  
  2. **`soft_format_reward_func`**: Checks for `<reasoning>.*</reasoning><answer>.*</answer>` in any multiline arrangement.  
  3. **`strict_format_reward_func`**: Strict multiline regex for exact formatting.  
  4. **`int_reward_func`**: Partial reward if the final `<answer>` is purely numeric.  
  5. **`correctness_reward_func`**: Binary reward if the final extracted answer exactly matches the known correct answer.

#### Training Hyperparameters

- **Base Model**: Qwen2.5-0.5B  
- **Learning Rate**: ~5e-6  
- **Batch Size**: 1–2 (due to GPU constraints)  
- **Optimizer**: AdamW (β1=0.9, β2=0.99)  
- **Scheduler**: Cosine with warmup_ratio=0.1  
- **Num Generations**: 16 (GRPO config)  
- **Number of Training Epochs**: 1 epoch on 2% data  
- **Hardware**: Single A100 40GB on Colab  
- **Max Prompt Length**: 256 tokens  
- **Max Completion Length**: 200 tokens  

### Speeds, Sizes, Times
- **Approx. Steps**: ~200–300 steps for 2% subset  
- **Run Time**: Varies from ~1 to 2 hours on Colab A100

## Evaluation

### Testing Data
- Currently trained + tested on the same subset (2%). Next step would be to evaluate on a withheld portion or the full set to measure true correctness.

### Metrics
- **Format Rewards**: `xmlcount`, `soft_format`, `strict_format`  
- **Correctness**: Exact match final numeric/string answer  
- **Partial Numeric**: `int_reward_func`

### Results
- The model shows a strong improvement in output format (70–80% format compliance) but relatively low exact numeric correctness. Additional epochs or a larger training fraction are needed for better correctness.

## Environmental Impact

- **Hardware**: Single A100 40GB GPU in a Colab environment  
- **Train Time**: ~1–2 hours on 2% data  
- **Carbon Footprint**: Not measured exactly, but minimal compared to large-scale runs

## Model Architecture & Objective

- **Architecture**: Transformer-based causal language model (Qwen2.5-0.5B)  
- **Objective**: RL-based chain-of-thought generation for math reasoning

## Citation

```
@misc{cooperQwen2.5-0.5B,
  title={Qwen2.5-0.5B Fine-Tuned on OpenR1 (2% subset)},
  author={Christian H. Cooper.},
  howpublished={\url{https://huggingface.co/Christian-cooper-us/Qwen2.5-0.5B-R1subset}},
  year={2025},
}
```

## Contact
- Maintainers: Christian Cooper (GitHub: [@christian-cooper-us](https://huggingface.co/HarleyCooper)), others.

---

**Disclaimer**: This model is experimental, trained on only 2% of the dataset. It may produce inaccurate math solutions and is not suitable for high-stakes or time-sensitive deployments.