havinashpatil commited on
Commit
7f4c57d
Β·
1 Parent(s): 5dffd52

Add detailed results charts to README and BLOG

Browse files
Files changed (2) hide show
  1. BLOG.md +13 -4
  2. README.md +14 -4
BLOG.md CHANGED
@@ -183,11 +183,20 @@ The key insight is that **the reward is not static** β€” it comes from actually
183
 
184
  We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
185
 
186
- ![Reward Curve](results/reward_curve.png)
187
- *Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
188
 
189
- ![Reward by Task](results/reward_by_task.png)
190
- *Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β€” exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
 
 
 
 
 
 
 
 
 
191
 
192
  ### Reproducing the Training
193
 
 
183
 
184
  We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
185
 
186
+ ![Fig 1: Reward Curve](results/reward_curve.png)
187
+ *Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
188
 
189
+ ![Fig 2: Reward by Task](results/reward_by_task.png)
190
+ *Fig 2: Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β€” exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
191
+
192
+ ![Fig 3: Task Performance Matrix](results/task_performance_matrix.png)
193
+ *Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*
194
+
195
+ ![Fig 4: Complexity Distribution](results/complexity_distribution.png)
196
+ *Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*
197
+
198
+ ![Fig 5: Fixer Method Boxplot](results/method_boxplot.png)
199
+ *Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*
200
 
201
  ### Reproducing the Training
202
 
README.md CHANGED
@@ -104,17 +104,27 @@ The agent must learn to **read error messages**, **avoid repeating failed fixes*
104
 
105
  We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
106
 
107
- ![Reward Curve](results/reward_curve.png)
108
- *Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
109
 
110
- ![Reward by Task](results/reward_by_task.png)
111
- *Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β€” exactly the curriculum behavior we designed for.*
 
 
 
 
 
 
 
 
 
112
 
113
  ### Key Observations:
114
  - **Initial performance**: Agent produces syntactically broken fixes β†’ reward β‰ˆ 0.01
115
  - **After 20 steps**: Agent learns to fix syntax β†’ reward β‰ˆ 0.35
116
  - **After 40 steps**: Agent learns to pass tests β†’ reward β‰ˆ 0.65
117
  - **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
 
118
 
119
  ---
120
 
 
104
 
105
  We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
106
 
107
+ ![Fig 1: Reward Curve](results/reward_curve.png)
108
+ *Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
109
 
110
+ ![Fig 2: Reward by Task](results/reward_by_task.png)
111
+ *Fig 2: Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β€” exactly the curriculum behavior we designed for.*
112
+
113
+ ![Fig 3: Task Performance Matrix](results/task_performance_matrix.png)
114
+ *Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*
115
+
116
+ ![Fig 4: Complexity Distribution](results/complexity_distribution.png)
117
+ *Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*
118
+
119
+ ![Fig 5: Fixer Method Boxplot](results/method_boxplot.png)
120
+ *Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*
121
 
122
  ### Key Observations:
123
  - **Initial performance**: Agent produces syntactically broken fixes β†’ reward β‰ˆ 0.01
124
  - **After 20 steps**: Agent learns to fix syntax β†’ reward β‰ˆ 0.35
125
  - **After 40 steps**: Agent learns to pass tests β†’ reward β‰ˆ 0.65
126
  - **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
127
+ - **Method Effectiveness (Fig 5)**: The LLM-based fixer significantly outperforms the static pattern-based approach.
128
 
129
  ---
130