GiuLeo01 commited on
Commit
eea4487
·
verified ·
1 Parent(s): a7252cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md CHANGED
@@ -152,6 +152,28 @@ A second phase followed, resetting the learning rate to `1e-6` with a linear dec
152
  ![Correct Reward](./imgs/grpo_2_correct_reward.png)
153
  ![Tot Reward](./imgs/grpo_2_tot_reward.png)
154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
 
157
 
 
152
  ![Correct Reward](./imgs/grpo_2_correct_reward.png)
153
  ![Tot Reward](./imgs/grpo_2_tot_reward.png)
154
 
155
+ ## Citation
156
+
157
+ If you use this model or parts of this work, please consider citing the references below.
158
+
159
+ ## References
160
+
161
+ * Qwen/Qwen2-5-Coder-3B-Instruct
162
+ [https://huggingface.co/Qwen/Qwen2-5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2-5-Coder-3B-Instruct)
163
+
164
+ * Group Relative Policy Optimization (GRPO):
165
+ [https://arxiv.org/abs/2205.13636](https://arxiv.org/abs/2205.13636)
166
+
167
+ * Unsloth – Fast and memory-efficient fine-tuning via QLoRA
168
+ [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
169
+
170
+ * Hugging Face Transformers
171
+ [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
172
+
173
+
174
+
175
+
176
+
177
 
178
 
179