Spaces:
Sleeping
Sleeping
| title: CodeArena RL Benchmark | |
| emoji: ποΈ | |
| colorFrom: green | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # CodeArena RL Benchmark | |
| GitHub Copilot, Cursor, Devin β every major coding AI is | |
| benchmarked on generation. Can it write a function? Can it | |
| complete a snippet? Nobody benchmarks what happens when the | |
| code breaks and the agent has to reason about failure, iterate | |
| on fixes, and recover from mistakes. | |
| CodeArena measures exactly that. It is the first standardized, | |
| open-source reinforcement learning environment built specifically | |
| for iterative code repair β graded not just on test pass rates | |
| but on whether the fix is correct, secure, and written to a | |
| professional standard. | |
| ## Features | |
| - **Adaptive Curriculum**: The environment supports an `auto` difficulty mode that dynamically scales task complexity based on the agent's recent rolling average rewards. | |
| - **Complex Shaped Rewards**: Rewards are a weighted composite: | |
| | Component | Weight | What it measures | | |
| |---|---|---| | |
| | compile_score | 20% | Code compiles without error | | |
| | test_pass_ratio | 40% | Fraction of unit tests passed | | |
| | efficiency_score | 10% | Speed vs optimal runtime | | |
| | llm_judge_score | 30% | Correctness + Security + Code Quality | | |
| | step_penalty | -0.02/step | Rewards faster fixes | | |
| | novelty_penalty | -0.10 | Penalises repeating identical fixes | | |
| All rewards clamped to [0.001, 0.999] | |
| - **Extensive Task Categories**: Includes standard algorithmic tasks, `type_errors`, and `security_bugs`. | |
| - **Real-time Reward Visualization**: Watch compile score, test ratio, and LLM judge scores update live as the agent works using the React Frontend. | |
| ## Adaptive Curriculum | |
| CodeArena tracks the agent's rolling average reward and | |
| escalates or de-escalates difficulty automatically. | |
| An agent cannot plateau by memorising easy tasks. | |
| | Condition | Transition | | |
| |---|---| | |
| | avg reward > 0.80 on easy | β medium | | |
| | avg reward > 0.75 on medium | β hard | | |
| | avg reward < 0.35 on hard | β medium | | |
| | avg reward < 0.35 on medium | β easy | | |
| Minimum 3 episodes at each level before any transition. | |
| Enable with: POST /reset with `{"task_id": "auto"}` | |
| Monitor live with: GET /curriculum | |
| ## Architecture | |
| **Data Flow:** Agent β `/reset` β buggy_code β `/step` β subprocess β LLM judge β reward β Agent | |
| - `server/`: FastAPI backend acting as the OpenEnv entrypoint. | |
| - `frontend/`: React + Vite frontend for live monitoring and manual intervention. | |
| - `tasks/`: Task definitions stored in OpenEnv-compatible JSON schema. | |
| - `inference.py`: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace `transformers` pipelines. | |
| ## Results | |
|  | |
| *Episode reward over training steps. Rolling 10-step average shown.* | |
|  | |
| *Average reward per task category.* | |
| | Model | Easy | Medium | Hard | Avg | | |
| |---|---|---|---|---| | |
| | GPT-4o | - | - | - | - | | |
| | Qwen-72B | - | - | - | - | | |
| | Llama-3-8B | - | - | - | - | | |
| ## Why It Matters | |
| Every production coding AI needs to debug, not just write. | |
| There is no other standardized RL environment that trains | |
| and benchmarks iterative repair. The hybrid grader β | |
| deterministic test execution plus LLM quality judgment β | |
| means agents cannot game the reward by memorising solutions | |
| or producing syntactically correct but semantically wrong fixes. | |
| ## Setup | |
| 1. **Install Dependencies:** | |
| ```bash | |
| pip install -r requirements.txt | |
| cd frontend && npm install | |
| ``` | |
| 2. **Generate New Tasks:** | |
| To populate the extended task categories (`type_errors` and `security_bugs`), run the task generator. This must be run first or the new task categories won't exist. | |
| ```bash | |
| python create_tasks.py | |
| ``` | |
| ## Usage | |
| ### 1. Run the Backend Server | |
| The server is required for both the frontend dashboard and RL training. | |
| ```bash | |
| uvicorn server.app:app --port 7860 | |
| ``` | |
| ### 2. Run the Frontend Dashboard | |
| ```bash | |
| cd frontend | |
| npm run dev | |
| ``` | |
| Navigate to `http://localhost:3000` to access the live RL monitoring dashboard. | |
| ### 3. Run Inference Evaluation | |
| You can evaluate a local agent or pipeline programmatically via `inference.py`. | |
| **Using OpenAI-Compatible Endpoints (e.g., Ollama or vLLM):** | |
| ```bash | |
| export API_BASE_URL="http://localhost:11434/v1" | |
| export MODEL_NAME="codellama" | |
| python inference.py --backend openai | |
| ``` | |
| **Using HuggingFace Transformers (Local pipeline):** | |
| ```bash | |
| export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B" | |
| python inference.py --backend hf | |
| ``` | |
| ## Reward Analysis | |
| As your agent interacts with the environment, inference logs are automatically written to `rewards_log.csv`. | |
| To visualize the reward curves over training steps and average rewards by task category, run: | |
| ```bash | |
| python plot_rewards.py | |
| ``` | |
| This generates `reward_curve.png` and `reward_by_task.png` in the `results/` directory. | |
| ## OpenEnv Compatibility | |
| This benchmark strictly adheres to the OpenEnv specification. See `openenv.yaml` for full configuration details. | |
| ## Links | |
| - HuggingFace Space: https://huggingface.co/spaces/adityanaikhpt/codearena | |
| - Colab Training Notebook: [URL] | |
| - HuggingFace Blog Post: [URL] | |
| - Demo Video: [URL] | |