GiuLeo01 commited on
Commit
a7252cc
·
verified ·
1 Parent(s): 1a4e6bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -134,6 +134,11 @@ The reward function used throughout this phase was very simple:
134
 
135
  The initial training phase was run for 3 epochs with:
136
 
 
 
 
 
 
137
  * batch size = 16
138
  * number of generations = 4
139
  * learning rate = 1e-5
@@ -143,6 +148,10 @@ The initial training phase was run for 3 epochs with:
143
 
144
  A second phase followed, resetting the learning rate to `1e-6` with a linear decay schedule.
145
 
 
 
 
 
146
 
147
 
148
 
 
134
 
135
  The initial training phase was run for 3 epochs with:
136
 
137
+
138
+ ![Compile Reward](./imgs/grpo_1_compile_reward.png)
139
+ ![Correct Reward](./imgs/grpo_1_correct_reward.png)
140
+ ![Tot Reward](./imgs/grpo_1_tot_reward.png)
141
+
142
  * batch size = 16
143
  * number of generations = 4
144
  * learning rate = 1e-5
 
148
 
149
  A second phase followed, resetting the learning rate to `1e-6` with a linear decay schedule.
150
 
151
+ ![Compile Reward](./imgs/grpo_2_compile_reward.png)
152
+ ![Correct Reward](./imgs/grpo_2_correct_reward.png)
153
+ ![Tot Reward](./imgs/grpo_2_tot_reward.png)
154
+
155
 
156
 
157