Spaces:

ceoavinash
/

codearena-rl

Sleeping

App Files Files Community

havinashpatil commited on 29 days ago

Commit

7f4c57d

1 Parent(s): 5dffd52

Add detailed results charts to README and BLOG

Browse files

Files changed (2) hide show

BLOG.md +13 -4
README.md +14 -4

BLOG.md CHANGED Viewed

@@ -183,11 +183,20 @@ The key insight is that **the reward is not static** — it comes from actually
 We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
-![Reward Curve](results/reward_curve.png)
-*Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
-![Reward by Task](results/reward_by_task.png)
-*Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging — exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
 ### Reproducing the Training

 We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
+![Fig 1: Reward Curve](results/reward_curve.png)
+*Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
+![Fig 2: Reward by Task](results/reward_by_task.png)
+*Fig 2: Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging — exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
+![Fig 3: Task Performance Matrix](results/task_performance_matrix.png)
+*Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*
+![Fig 4: Complexity Distribution](results/complexity_distribution.png)
+*Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*
+![Fig 5: Fixer Method Boxplot](results/method_boxplot.png)
+*Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*
 ### Reproducing the Training

README.md CHANGED Viewed

@@ -104,17 +104,27 @@ The agent must learn to **read error messages**, **avoid repeating failed fixes*
 We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
-![Reward Curve](results/reward_curve.png)
-*Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
-![Reward by Task](results/reward_by_task.png)
-*Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging — exactly the curriculum behavior we designed for.*
 ### Key Observations:
 - **Initial performance**: Agent produces syntactically broken fixes → reward ≈ 0.01
 - **After 20 steps**: Agent learns to fix syntax → reward ≈ 0.35
 - **After 40 steps**: Agent learns to pass tests → reward ≈ 0.65
 - **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
 ---

 We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
+![Fig 1: Reward Curve](results/reward_curve.png)
+*Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
+![Fig 2: Reward by Task](results/reward_by_task.png)
+*Fig 2: Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging — exactly the curriculum behavior we designed for.*
+![Fig 3: Task Performance Matrix](results/task_performance_matrix.png)
+*Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*
+![Fig 4: Complexity Distribution](results/complexity_distribution.png)
+*Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*
+![Fig 5: Fixer Method Boxplot](results/method_boxplot.png)
+*Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*
 ### Key Observations:
 - **Initial performance**: Agent produces syntactically broken fixes → reward ≈ 0.01
 - **After 20 steps**: Agent learns to fix syntax → reward ≈ 0.35
 - **After 40 steps**: Agent learns to pass tests → reward ≈ 0.65
 - **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
+- **Method Effectiveness (Fig 5)**: The LLM-based fixer significantly outperforms the static pattern-based approach.
 ---