Spaces:

ceoavinash
/

codearena-rl

Sleeping

App Files Files Community

havinashpatil commited on Apr 26

Commit

90be6c7

1 Parent(s): 5e35378

Final hackathon submission: polished README + detailed blog writeup

Browse files

Files changed (2) hide show

BLOG.md +294 -0
README.md +178 -70

BLOG.md ADDED Viewed

	@@ -0,0 +1,294 @@

+# CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning
+**An OpenEnv-compatible RL environment for iterative code repair with adaptive difficulty, hybrid grading, and self-improving agent memory.**
+[![HuggingFace Space](https://img.shields.io/badge/🤗%20Space-Live%20Demo-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue)](https://github.com/havinashpatil/meta)
+---
+## The Problem: Why We Built CodeArena
+Every major AI coding assistant — GitHub Copilot, Cursor, Devin — is benchmarked on **code generation**. Can it write a function? Can it complete a snippet?
+But here's the gap nobody is talking about: **what happens when the code breaks?**
+In production, code breaks constantly. A real developer doesn't just generate code — they spend the majority of their time **reading error logs, reasoning about failure, iterating on fixes, and recovering from mistakes.** This iterative debugging loop is the core skill that separates a junior developer from a senior one.
+Yet there is no standardized RL environment to train or evaluate an LLM on this capability. HumanEval measures one-shot generation. MBPP measures function completion. Neither measures what happens across multiple repair attempts when the first fix doesn't work.
+**CodeArena** is the first open-source, OpenEnv-compatible reinforcement learning environment built specifically for **iterative code repair**.
+---
+## How CodeArena Works
+### The Loop
+CodeArena simulates the real-world debugging workflow:
+```
+1. Agent receives buggy Python code + error log
+2. Agent proposes a fix
+3. Environment executes the fix in a sandboxed subprocess
+4. Environment runs unit tests and scores the fix
+5. Agent receives reward + updated error log
+6. Repeat up to 5 steps
+```
+This is fundamentally different from one-shot code generation benchmarks. The agent must:
+- **Read and interpret error messages** from previous attempts
+- **Track what it has already tried** (repeated fixes are penalized)
+- **Decide whether to patch locally or rewrite entirely**
+- **Optimize for efficiency**, not just correctness
+### Architecture
+```
+Agent ─── POST /reset ──→ CodeArena Server ──→ Returns buggy_code + error_log
+  │                            │
+  │                            ├── Task Loader (9 tasks across 5 categories)
+  │                            ├── Sandboxed Executor (subprocess + timeout)
+  │                            ├── Hybrid Grader (tests + LLM judge)
+  │                            ├── Algorithm Detector (complexity analysis)
+  │                            └── Agent Memory (self-improving store)
+  │
+  └── POST /step ────────→ Returns observation, reward, done, info
+```
+The server is a standard FastAPI application that implements the OpenEnv specification (`/reset`, `/step`, `/state`). The `openenv.yaml` manifest defines the observation space (buggy code, error log, test results, previous attempts) and the action space (proposed fix).
+---
+## What Makes CodeArena Special (Environment Innovation)
+### 1. Hybrid Grader: Tests + LLM-as-Judge
+Most coding benchmarks use a single signal: did the tests pass? This creates a fundamental problem — agents learn to produce code that passes weak tests through reward-hacking (e.g., hardcoding expected outputs, or producing syntactically correct but semantically broken code).
+CodeArena uses a **Hybrid Grader** with six weighted components:
+| Component | Weight | What It Measures |
+|---|---|---|
+| `compile_score` | 15% | Code compiles without syntax errors |
+| `test_pass_ratio` | 35% | Fraction of unit tests passed |
+| `efficiency_score` | 30% | Execution time vs. optimal runtime |
+| `llm_correctness` | 10% | LLM judge: is the fix logically correct? |
+| `llm_security` | 5% | LLM judge: does the fix introduce vulnerabilities? |
+| `llm_quality` | 5% | LLM judge: is the code readable and maintainable? |
+Additionally, two penalties are applied:
+- **Step penalty** (`-0.01 × step_count`): Rewards faster fixes
+- **Novelty penalty** (`-0.10`): Penalizes submitting the same fix twice
+The LLM judge is called via the OpenAI-compatible API (configurable to GPT-4o-mini, local Ollama, or HuggingFace Inference). When no API key is available, it falls back to neutral scores (0.5), ensuring the environment always runs.
+**Why this matters for training:** The heavy 30% weight on efficiency means that an agent that passes all tests with an O(n²) brute-force solution gets a significantly lower reward than one that uses an O(n) algorithm. This forces the model to learn *algorithmic reasoning*, not just syntax repair.
+### 2. Adaptive Curriculum (Theme #4: Self-Improvement)
+CodeArena doesn't use a fixed task set. It features an **Adaptive Curriculum** that tracks the agent's rolling average reward over recent episodes and automatically adjusts difficulty:
+| Condition | Transition |
+|---|---|
+| avg reward > 0.80 on Easy | → Medium |
+| avg reward > 0.75 on Medium | → Hard |
+| avg reward < 0.35 on Hard | → Medium (de-escalate) |
+| avg reward < 0.35 on Medium | → Easy (de-escalate) |
+This is activated by passing `task_id: "auto"` to the `/reset` endpoint.
+**Why this matters:** The agent cannot plateau by memorizing solutions to easy tasks. As soon as it masters syntax errors, the environment pushes it to algorithmic logic bugs. If it struggles, it recovers on easier tasks before trying again. This creates a natural *recursive skill amplification* loop — the environment drives the agent's own capability growth.
+### 3. Algorithm Detection + Adaptive Prompting
+CodeArena includes a built-in **Algorithm Detector** (`server/algorithm_detector.py`) that:
+1. **Classifies the problem type** (max subarray, two-sum, binary search, sliding window, etc.) from code patterns
+2. **Estimates time complexity** by analyzing loop nesting depth (O(1) → O(n) → O(n²) → O(n³))
+3. **Generates targeted optimization hints** (e.g., "Use Kadane's Algorithm O(n): `curr = max(num, curr+num)`")
+When the AI fixer generates a repair, the algorithm detector provides **adaptive prompt suffixes** based on the current reward level:
+- Low reward (< 0.4): "Focus on correctness. Fix syntax errors first."
+- Medium reward (0.4–0.7): "Fix edge cases and logic bugs."
+- High reward (> 0.7): "Optimize for performance. Use O(n) algorithms."
+### 4. Self-Improving Agent Memory
+CodeArena includes a persistent **Agent Memory** system (`server/memory.py`) that stores the best solution found for each task. When the agent encounters the same task type again, it can retrieve its previous best solution as a starting point.
+This creates a genuine self-improvement loop:
+- Episode 1: Agent fixes syntax → reward 0.45
+- Episode 5: Agent recalls its best previous fix, optimizes further → reward 0.72
+- Episode 10: Agent has accumulated enough memory to skip basic fixes entirely → reward 0.88
+The memory is persisted to `agent_memory.json` and survives server restarts.
+### 5. Rich Task Diversity
+CodeArena ships with **9 tasks across 5 categories**:
+| Category | Tasks | What It Tests |
+|---|---|---|
+| Easy (syntax) | Missing colons, wrong indentation | Basic Python syntax repair |
+| Medium (logic) | Off-by-one errors, wrong conditions | Algorithmic reasoning |
+| Hard (optimization) | O(n²) → O(n) refactoring | Algorithm design |
+| Type Errors | Wrong types, missing conversions | Type system understanding |
+| Security Bugs | SQL injection, path traversal | Security awareness |
+Each task includes:
+- Buggy source code
+- Multiple unit tests
+- An optimal execution time baseline (for efficiency scoring)
+---
+## Training Pipeline: TRL GRPO on CodeArena
+We trained a coding model using **Hugging Face TRL's GRPO (Group Relative Policy Optimization)** trainer, connecting it directly to the CodeArena environment as a live reward signal.
+### How It Works
+```python
+# The reward function queries CodeArena's /step endpoint
+def codearena_reward_func(completions, prompts):
+    rewards = []
+    for completion in completions:
+        proposed_fix = completion[0].get('content', '').strip()
+        res = httpx.post("http://localhost:7860/step",
+                         json={"proposed_fix": proposed_fix})
+        reward = res.json().get('reward', 0.0)
+        rewards.append(reward)
+    return rewards
+# GRPO training with CodeArena as the reward environment
+trainer = GRPOTrainer(
+    model=model,
+    reward_funcs=codearena_reward_func,
+    args=GRPOConfig(
+        output_dir="./codearena-grpo",
+        learning_rate=1e-5,
+        max_steps=50,
+        per_device_train_batch_size=2,
+    ),
+    train_dataset=dataset,
+)
+trainer.train()
+```
+The key insight is that **the reward is not static** — it comes from actually executing the agent's proposed code against real unit tests in a sandboxed environment, then grading it with the hybrid scorer. This is true environment-in-the-loop RL, not reward modeling on a frozen dataset.
+### Training Results
+We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
+![Reward Curve](results/reward_curve.png)
+*Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
+![Reward by Task](results/reward_by_task.png)
+*Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging — exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
+### Reproducing the Training
+The complete training pipeline is available as a Colab notebook:
+��� **[Open in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
+The notebook:
+1. Installs all dependencies (`trl`, `transformers`, `httpx`)
+2. Clones the CodeArena repository
+3. Starts the FastAPI backend server
+4. Loads `Qwen2.5-Coder-1.5B` with GRPO configuration
+5. Trains against the live environment
+6. Logs rewards per step
+---
+## Live Demo: Try It Now
+The fully-functional CodeArena environment is deployed on Hugging Face Spaces with a React frontend dashboard:
+👉 **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
+### What You Can Do on the Live Demo:
+1. **Start an Episode**: Select Easy/Medium/Hard difficulty and load a buggy code task
+2. **Manual Fix**: Edit the code yourself and click "Run Step" to see your reward
+3. **AI Fix**: Click the ✨ AI FIX button to have the built-in AI repair agent (powered by `Qwen2.5-Coder-3B-Instruct` via HuggingFace Serverless Inference) generate a fix
+4. **Agent Mode**: Toggle auto-pilot to watch the agent autonomously fix → test → fix → test in a loop
+5. **Sandbox Mode**: Paste your own arbitrary Python code and watch the environment evaluate it
+The dashboard shows real-time reward components (compile score, test ratio, efficiency), a terminal log of every step, and a reward chart that updates live.
+---
+## Technical Deep Dive
+### Sandboxed Execution
+All agent-submitted code runs in an isolated subprocess with:
+- **AST syntax validation** before execution (catches syntax errors without running code)
+- **Timeout enforcement** (configurable per task, default 5s)
+- **Temporary file execution** (code is written to a temp file, executed, then deleted)
+- **Structured output parsing** (test results are communicated via a `|CODEARENA_STATS|` sentinel)
+This ensures that malicious or infinite-loop code cannot crash the server.
+### AI Code Fixer Pipeline
+The built-in AI fixer (`server/ai_fixer.py`, 600+ lines) implements a sophisticated multi-fallback pipeline:
+1. **TGI / HuggingFace Serverless API** (Priority 1): Calls `Qwen2.5-Coder-3B-Instruct` for high-quality fixes
+2. **Local Ollama** (Priority 2): Falls back to a local LLM if available
+3. **AST Pattern-Based Fixer** (Priority 3): 20+ pattern rules for common Python bugs:
+   - Missing colons after `def`, `if`, `for`, `while`
+   - Missing `return` statements
+   - Wrong comparison operators (`=` → `==`)
+   - Missing `self` parameter in class methods
+   - Incorrect indentation repair
+   - And many more
+The fixer also includes a **code validator** that catches fixes worse than the original (e.g., introduces new syntax errors), and a **self-critique loop** that re-checks the generated code before returning it.
+### Complexity-Reward Tracking
+Every fix is logged to `complexity_rewards.csv` with:
+- Task ID
+- Reward achieved
+- Detected time complexity
+- Fix method (TGI/Ollama/built-in)
+This creates a research dataset that proves our core hypothesis: **agents that produce O(n) solutions consistently receive higher rewards than those producing O(n²) solutions.**
+---
+## Why CodeArena Matters
+**Writing code is a solved problem.** GPT-4, Claude, Gemini — they can all generate working functions from natural language descriptions.
+**Debugging code autonomously — reasoning about failure, iterating on fixes, recovering from wrong turns — is not solved.**
+Every production coding system will eventually face broken code. There is no other standardized RL environment that trains and benchmarks iterative repair at this level. CodeArena fills that gap with:
+- A **hybrid grader** that prevents reward-hacking
+- An **adaptive curriculum** for continuous self-improvement
+- A **persistent memory** for cross-episode learning
+- A **rich task library** spanning syntax, logic, algorithms, types, and security
+- Full **OpenEnv compatibility** for plug-and-play evaluation
+CodeArena is infrastructure. Plug any model in. Run it. Get a number. Compare it against the baseline. Train on it. Watch it improve.
+---
+## Links & Resources
+| Resource | Link |
+|---|---|
+| 🤗 Live Demo (HF Space) | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
+| 📓 Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
+| 💻 Source Code (GitHub) | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
+| 📋 OpenEnv Manifest | [openenv.yaml](https://github.com/havinashpatil/meta/blob/main/openenv.yaml) |
+---
+*Built for the OpenEnv Hackathon India 2026 — Theme #4: Self-Improvement*

README.md CHANGED Viewed

@@ -6,124 +6,232 @@ colorTo: purple
 sdk: docker
 pinned: true
 ---
-[![HuggingFace Space](https://img.shields.io/badge/🤗%20Space-Live-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
-[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue)](./openenv.yaml)
 [![Theme](https://img.shields.io/badge/Theme%20%234-Self--Improvement-purple)]()
-# 🚀 CodeArena: The Iterative Code Repair RL Benchmark
-GitHub Copilot, Cursor, Devin — every major coding AI is benchmarked on *generation*. Can it write a function? Can it complete a snippet?
-But nobody benchmarks what happens when the code **breaks**. When the agent has to reason about failure, read error logs, iterate on fixes, and recover from its own mistakes.
-**CodeArena** measures exactly that. It is the first standardized, open-source Reinforcement Learning environment built specifically for **iterative code repair**. It grades an agent not just on whether the tests pass, but on whether the fix is correct, secure, and algorithmically efficient.
----
-## 🎯 Hackathon Theme Alignment: Theme #4 (Self-Improvement)
-CodeArena directly tackles **Theme #4: Self-Improvement**.
-Instead of a fixed set of tasks, CodeArena features an **Adaptive Curriculum**. The environment continuously tracks the agent's rolling average reward over the last 10 episodes. If an agent masters easy syntax errors (avg reward > 0.80), the environment automatically escalates the difficulty to algorithmic logic bugs. If the agent struggles, it de-escalates to allow recovery.
-The goal is recursive skill amplification: the agent learns to drive its own capability growth without plateauing on memorized, simple solutions.
 ---
-## ✨ Environment Innovation (What makes it special?)
-### 1. The Gap Nobody Is Measuring
-We have countless environments for generating code (HumanEval, MBPP). CodeArena is the first standardized RL environment for the *debugging loop*. It simulates the real-world workflow: write → test → read error → fix → repeat.
-### 2. LLM-as-Judge Hybrid Grader
-Most benchmarks ask a binary question: *did the tests pass?* CodeArena uses a rich **Hybrid Grader**. A deterministic test runner checks correctness, while a built-in LLM Judge (powered by TGI/Hugging Face Serverless) scores the fix on security, readability, and algorithmic complexity (O(N) vs O(N²)). This prevents reward-hacking where agents produce syntactically correct but fundamentally broken code just to pass a weak test.
-### 3. Complex Shaped Rewards
-Rewards are a weighted composite, heavily shaped to encourage professional engineering:
-- **Test Pass Ratio (40%)**: Fraction of unit tests passed.
-- **LLM Judge Score (30%)**: Correctness + Security + Code Quality.
-- **Compile Score (20%)**: Does it run without crashing?
-- **Efficiency Score (10%)**: Speed vs optimal runtime.
-- **Step Penalty (-0.02/step)**: Rewards faster fixes over meandering trial-and-error.
 ---
-## 📈 Evidence of Training & Rewards
-We successfully trained a model using **TRL GRPO** (Group Relative Policy Optimization) on the CodeArena environment.
-Below is the observable evidence of the agent's training progress. The agent started with a low success rate on algorithmic bugs, but as the GRPO training progressed, it learned to systematically read the `error_log` observation and output correct code, resulting in a climbing reward curve.
 ![Reward Curve](results/reward_curve.png)
-*Episode reward over training steps. The rolling 10-step average shows clear learning and improvement.*
 ![Reward by Task](results/reward_by_task.png)
-*Average reward broken down by task category. The agent performs well on syntax and type errors, while Medium/Hard algorithmic tasks remain challenging but improving.*
-### 🏃‍♂️ Run the Training Script
-We have provided our complete TRL GRPO training pipeline in a Colab notebook so judges can re-run and verify the training process end-to-end:
-👉 **[Open Training Script in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
 ---
-## 💻 Try the Live Environment (Hugging Face Space)
-We have deployed the fully-functional CodeArena environment, complete with a React frontend dashboard that visualizes the RL process in real-time.
-👉 **[Live Demo: CodeArena on Hugging Face Spaces](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
-The live space includes a built-in **AI Code Fixer** powered by Hugging Face's Serverless Inference API (using `Qwen2.5-Coder-3B-Instruct`), allowing you to test the agent's repair capabilities directly in your browser.
-### Features of the Live Space:
-- **Real-time Monitoring**: Watch the agent's compile score, test ratio, and LLM judge scores update live.
-- **Sandbox Mode**: Paste your own broken Python code and watch the environment evaluate it.
-- **Agent Mode**: Toggle auto-pilot to watch the agent fix code in a continuous loop until optimal.
----
-## 🛠️ Architecture & Setup (OpenEnv Compatible)
-This benchmark strictly adheres to the **OpenEnv** specification (`openenv.yaml`).
-**Data Flow:** `Agent` → `POST /reset` → `buggy_code` → `POST /step` → `LLM Judge & Test Runner` → `reward` → `Agent`
-### Local Development
-1. **Install Dependencies:**
-   ```bash
-   pip install -r requirements.txt
-   cd frontend && npm install
-   ```
-2. **Generate Task Database:**
-   ```bash
-   python create_tasks.py
-   ```
-3. **Run the FastAPI Backend:**
-   The backend acts as the OpenEnv entrypoint and serves the compiled React dashboard.
-   ```bash
-   uvicorn server.app:app --port 7860
-   ```
-4. **Evaluate a Local Agent (Inference):**
-   You can evaluate any local agent (e.g., Ollama or a HuggingFace pipeline) programmatically via `inference.py`.
-   ```bash
-   export MODEL_NAME="codellama:7b-instruct"
-   python inference.py --backend openai
-   ```
 ---
-## 🔗 Quick Links
 | Resource | URL |
 |---|---|
-| **Hugging Face Space (Live Demo)** | [CodeArena on HF Spaces](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
-| **Colab Training Notebook (TRL)** | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
-| **OpenEnv Specification** | [openenv.yaml](./openenv.yaml) |
-| **Demo Video / Blog Post** | *(Add link to YouTube/HF Blog here if available)* |
 ---
-*Built for the OpenEnv Hackathon India 2026.*

 sdk: docker
 pinned: true
 ---
+[![HuggingFace Space](https://img.shields.io/badge/🤗%20Live%20Demo-CodeArena-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B%20Compatible-blue)](./openenv.yaml)
 [![Theme](https://img.shields.io/badge/Theme%20%234-Self--Improvement-purple)]()
+[![Blog](https://img.shields.io/badge/📝%20Blog-Read%20Writeup-orange)](./BLOG.md)
+# 🚀 CodeArena: Iterative Code Repair as an RL Environment
+> **TL;DR** — An OpenEnv-compatible RL environment where an LLM agent debugs Python code across multiple attempts, graded by unit tests + LLM-as-Judge + algorithmic efficiency. Features adaptive difficulty, agent memory, and a full TRL GRPO training pipeline.
+---
+## 🎯 The Problem
+Every coding AI is benchmarked on **generation** — write a function, complete a snippet. **Nobody benchmarks what happens when the code breaks.** In production, developers spend the majority of their time reading error logs, reasoning about failures, iterating on fixes, and recovering from wrong turns. There is no standardized RL environment for this iterative debugging loop.
+**CodeArena fills that gap.** It is the first open-source RL environment built specifically for *iterative code repair*, where an agent must fix buggy Python code over multiple steps, learning from execution feedback after each attempt.
+---
+## 🧠 Theme Alignment: #4 — Self-Improvement
+CodeArena directly targets **Theme #4: Self-Improvement** through three mechanisms:
+1. **Adaptive Curriculum**: Difficulty escalates automatically when the agent's rolling avg reward exceeds 0.80, and de-escalates when it drops below 0.35. The agent drives its own training progression.
+2. **Persistent Agent Memory**: Best solutions per task are stored in `agent_memory.json` and retrieved in future episodes, creating cross-episode learning.
+3. **Adaptive Prompting**: The AI fixer adjusts its strategy based on current reward level — syntax focus at low rewards, algorithm optimization at high rewards.
 ---
+## ✨ Environment Innovation (40%)
+### Hybrid Grader — Tests + LLM-as-Judge
+Most benchmarks ask: *did the tests pass?* CodeArena also asks: *is the fix correct, secure, efficient, and readable?*
+| Component | Weight | Signal |
+|---|---|---|
+| `compile_score` | 15% | Code compiles without error |
+| `test_pass_ratio` | 35% | Fraction of unit tests passed |
+| `efficiency_score` | 30% | Execution time vs optimal (O(n) rewarded, O(n²) penalized) |
+| `llm_correctness` | 10% | LLM judge: logical correctness |
+| `llm_security` | 5% | LLM judge: no vulnerabilities introduced |
+| `llm_quality` | 5% | LLM judge: readability and maintainability |
+**Penalties:** `-0.01/step` (rewards faster fixes) and `-0.10` for repeating an identical fix (prevents reward-hacking via repetition).
+The 30% efficiency weight means an agent that passes all tests with O(n²) brute-force gets a significantly lower reward than one using O(n). This forces the model to learn *algorithmic reasoning*, not just syntax repair.
+### Algorithm Detector
+A built-in classifier (`server/algorithm_detector.py`) identifies the problem type (Kadane's, Two-Sum, Sliding Window, etc.) and estimates time complexity from loop nesting. This drives targeted optimization hints during repair.
+### Sandboxed Execution
+All code runs in isolated subprocesses with AST pre-validation, timeout enforcement, and temporary file cleanup. Malicious or infinite-loop code cannot crash the server.
+### 9 Tasks Across 5 Categories
+| Category | Example | Tests |
+|---|---|---|
+| Easy (syntax) | Missing colons, indentation | Basic repair |
+| Medium (logic) | Off-by-one, wrong conditions | Reasoning |
+| Hard (algorithms) | O(n²) → O(n) refactoring | Optimization |
+| Type Errors | Wrong types, missing casts | Type safety |
+| Security Bugs | SQL injection, path traversal | Security awareness |
 ---
+## 📊 Storytelling (30%) — How It Works
+**Data Flow:** `Agent` → `POST /reset` → receives `buggy_code + error_log` → `POST /step` with `proposed_fix` → sandboxed execution → hybrid grading → `reward + updated error_log` → repeat up to 5 steps.
+```
+Episode Walkthrough:
+────────────────────────
+Step 1: Agent receives def solve(n) print(n)
+        → Proposes:     def solve(n): print(n)
+        → Result:       ✓ Compiles, 1/3 tests pass
+        → Reward:       0.35
+Step 2: Agent reads error: "AssertionError: solve(5) != 25"
+        → Proposes:     def solve(n): return n**2
+        → Result:       ✓ 3/3 tests pass, but O(n) expected
+        → Reward:       0.72
+Step 3: Agent reads hint: "Optimize to O(1)"
+        → Proposes:     def solve(n): return n*n
+        → Result:       ✓ 3/3 pass, O(1) optimal
+        → Reward:       0.95 ✅
+```
+The agent must learn to **read error messages**, **avoid repeating failed fixes**, and **optimize for efficiency** — not just correctness. This mirrors real-world software engineering.
+---
+## 📈 Showing Improvement in Rewards (20%)
+We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
 ![Reward Curve](results/reward_curve.png)
+*Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
 ![Reward by Task](results/reward_by_task.png)
+*Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging — exactly the curriculum behavior we designed for.*
+### Key Observations:
+- **Initial performance**: Agent produces syntactically broken fixes → reward ≈ 0.01
+- **After 20 steps**: Agent learns to fix syntax → reward ≈ 0.35
+- **After 40 steps**: Agent learns to pass tests → reward ≈ 0.65
+- **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
 ---
+## 🔧 Reward & Training Pipeline (10%)
+### Training Script (Colab)
+👉 **[Open Training Notebook in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
+The notebook demonstrates environment-in-the-loop RL:
+```python
+def codearena_reward_func(completions, prompts):
+    """Reward function that queries the live CodeArena environment."""
+    rewards = []
+    for completion in completions:
+        proposed_fix = completion[0].get('content', '').strip()
+        res = httpx.post("http://localhost:7860/step",
+                         json={"proposed_fix": proposed_fix})
+        reward = res.json().get('reward', 0.0)
+        rewards.append(reward)
+    return rewards
+trainer = GRPOTrainer(
+    model=model,  # Qwen2.5-Coder-1.5B
+    reward_funcs=codearena_reward_func,
+    args=GRPOConfig(output_dir="./codearena-grpo",
+                    learning_rate=1e-5, max_steps=50),
+    train_dataset=dataset,  # m-a-p/Code-Feedback
+)
+trainer.train()
+```
+The reward is **not static** — it comes from actually executing the agent's code in a sandboxed environment, running real unit tests, and scoring with the hybrid grader. This is true environment-in-the-loop RL.
+### Inference Evaluation
+```bash
+# Evaluate any model against CodeArena
+export MODEL_NAME="codellama:7b-instruct"
+python inference.py --backend openai
+```
+Results are logged to `rewards_log.csv` and can be visualized with `python plot_rewards.py`.
+---
+## 🏗️ Architecture (OpenEnv Compatible)
+```
+codearena-rl/
+├── openenv.yaml              # OpenEnv manifest (observation/action spaces)
+├── server/
+│   ├── app.py                # FastAPI entrypoint (/reset, /step, /state)
+│   ├── models.py             # Pydantic schemas (Observation, Action, Task)
+│   ├── executor.py           # Sandboxed subprocess execution
+│   ├── grader.py             # Hybrid reward (tests + LLM judge)
+│   ├── ai_fixer.py           # Multi-fallback AI repair (TGI→Ollama→AST)
+│   ├── algorithm_detector.py # Problem classification + complexity detection
+│   ├── memory.py             # Persistent agent memory (best solutions)
+│   └── raw_runner.py         # Sandbox mode executor
+├── tasks/
+│   ├── easy.py, medium.py, hard.py
+│   ├── type_errors/          # 3 type error tasks
+│   └── security_bugs/        # 3 security bug tasks
+├── frontend/                 # React + Vite dashboard
+├── train_grpo.ipynb          # TRL GRPO training notebook
+├── inference.py              # CLI evaluation runner
+├── plot_rewards.py           # Reward visualization
+└── Dockerfile                # HF Spaces deployment
+```
+### Quick Start
+```bash
+pip install -r requirements.txt
+python create_tasks.py           # Generate task database
+uvicorn server.app:app --port 7860  # Start environment
+```
+### OpenEnv API
+| Endpoint | Method | Description |
+|---|---|---|
+| `/reset` | POST | Initialize environment with `{"task_id": "easy\|medium\|hard\|auto"}` |
+| `/step` | POST | Submit fix with `{"proposed_fix": "..."}` → reward + observation |
+| `/state` | GET | Current observation |
+| `/health` | GET | Server health check |
+| `/fix` | POST | AI code repair endpoint |
+| `/curriculum` | GET | Adaptive difficulty state |
+| `/stats` | GET | Complexity vs reward analytics |
+| `/memory` | GET | Agent memory contents |
+---
+## 💻 Live Demo
+👉 **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
+Features:
+- **Real-time dashboard** with reward charts, terminal logs, and code editor
+- **AI Fix button** powered by HuggingFace Serverless Inference (`Qwen2.5-Coder-3B-Instruct`)
+- **Agent Mode** toggle for autonomous fix → test → fix loops
+- **Sandbox Mode** for arbitrary Python code evaluation
 ---
+## 🔗 All Links
 | Resource | URL |
 |---|---|
+| **🤗 HuggingFace Space (Live)** | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
+| **📓 Training Notebook (Colab)** | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
+| **📝 Blog / Writeup** | [BLOG.md](./BLOG.md) |
+| **💻 GitHub Repository** | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
+| **📋 OpenEnv Manifest** | [openenv.yaml](./openenv.yaml) |
 ---
+*Built for the OpenEnv Hackathon India 2026 — Theme #4: Self-Improvement*