Spaces:
Sleeping
Sleeping
havinashpatil commited on
Commit Β·
7f4c57d
1
Parent(s): 5dffd52
Add detailed results charts to README and BLOG
Browse files
BLOG.md
CHANGED
|
@@ -183,11 +183,20 @@ The key insight is that **the reward is not static** β it comes from actually
|
|
| 183 |
|
| 184 |
We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
|
| 185 |
|
| 186 |
-

|
| 187 |
-
*Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
|
| 188 |
|
| 189 |
-

|
| 190 |
-
*Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
### Reproducing the Training
|
| 193 |
|
|
|
|
| 183 |
|
| 184 |
We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
|
| 185 |
|
| 186 |
+

|
| 187 |
+
*Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
|
| 188 |
|
| 189 |
+

|
| 190 |
+
*Fig 2: Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
|
| 191 |
+
|
| 192 |
+

|
| 193 |
+
*Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*
|
| 194 |
+
|
| 195 |
+

|
| 196 |
+
*Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*
|
| 197 |
+
|
| 198 |
+

|
| 199 |
+
*Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*
|
| 200 |
|
| 201 |
### Reproducing the Training
|
| 202 |
|
README.md
CHANGED
|
@@ -104,17 +104,27 @@ The agent must learn to **read error messages**, **avoid repeating failed fixes*
|
|
| 104 |
|
| 105 |
We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
|
| 106 |
|
| 107 |
-

|
| 108 |
-
*Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
|
| 109 |
|
| 110 |
-

|
| 111 |
-
*Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β exactly the curriculum behavior we designed for.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
### Key Observations:
|
| 114 |
- **Initial performance**: Agent produces syntactically broken fixes β reward β 0.01
|
| 115 |
- **After 20 steps**: Agent learns to fix syntax β reward β 0.35
|
| 116 |
- **After 40 steps**: Agent learns to pass tests β reward β 0.65
|
| 117 |
- **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
|
|
|
|
| 118 |
|
| 119 |
---
|
| 120 |
|
|
|
|
| 104 |
|
| 105 |
We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
|
| 106 |
|
| 107 |
+

|
| 108 |
+
*Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
|
| 109 |
|
| 110 |
+

|
| 111 |
+
*Fig 2: Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β exactly the curriculum behavior we designed for.*
|
| 112 |
+
|
| 113 |
+

|
| 114 |
+
*Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*
|
| 115 |
+
|
| 116 |
+

|
| 117 |
+
*Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*
|
| 118 |
+
|
| 119 |
+

|
| 120 |
+
*Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*
|
| 121 |
|
| 122 |
### Key Observations:
|
| 123 |
- **Initial performance**: Agent produces syntactically broken fixes β reward β 0.01
|
| 124 |
- **After 20 steps**: Agent learns to fix syntax β reward β 0.35
|
| 125 |
- **After 40 steps**: Agent learns to pass tests β reward β 0.65
|
| 126 |
- **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
|
| 127 |
+
- **Method Effectiveness (Fig 5)**: The LLM-based fixer significantly outperforms the static pattern-based approach.
|
| 128 |
|
| 129 |
---
|
| 130 |
|