codearena / README.md
adityanaikhpt's picture
Add HF Spaces YAML metadata and live Space URL
dc3adb2 verified
metadata
title: CodeArena RL Benchmark
emoji: 🏟️
colorFrom: green
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

CodeArena RL Benchmark

GitHub Copilot, Cursor, Devin β€” every major coding AI is benchmarked on generation. Can it write a function? Can it complete a snippet? Nobody benchmarks what happens when the code breaks and the agent has to reason about failure, iterate on fixes, and recover from mistakes.

CodeArena measures exactly that. It is the first standardized, open-source reinforcement learning environment built specifically for iterative code repair β€” graded not just on test pass rates but on whether the fix is correct, secure, and written to a professional standard.

Features

  • Adaptive Curriculum: The environment supports an auto difficulty mode that dynamically scales task complexity based on the agent's recent rolling average rewards.
  • Complex Shaped Rewards: Rewards are a weighted composite:
Component Weight What it measures
compile_score 20% Code compiles without error
test_pass_ratio 40% Fraction of unit tests passed
efficiency_score 10% Speed vs optimal runtime
llm_judge_score 30% Correctness + Security + Code Quality
step_penalty -0.02/step Rewards faster fixes
novelty_penalty -0.10 Penalises repeating identical fixes

All rewards clamped to [0.001, 0.999]

  • Extensive Task Categories: Includes standard algorithmic tasks, type_errors, and security_bugs.
  • Real-time Reward Visualization: Watch compile score, test ratio, and LLM judge scores update live as the agent works using the React Frontend.

Adaptive Curriculum

CodeArena tracks the agent's rolling average reward and escalates or de-escalates difficulty automatically. An agent cannot plateau by memorising easy tasks.

Condition Transition
avg reward > 0.80 on easy β†’ medium
avg reward > 0.75 on medium β†’ hard
avg reward < 0.35 on hard β†’ medium
avg reward < 0.35 on medium β†’ easy

Minimum 3 episodes at each level before any transition. Enable with: POST /reset with {"task_id": "auto"} Monitor live with: GET /curriculum

Architecture

Data Flow: Agent β†’ /reset β†’ buggy_code β†’ /step β†’ subprocess β†’ LLM judge β†’ reward β†’ Agent

  • server/: FastAPI backend acting as the OpenEnv entrypoint.
  • frontend/: React + Vite frontend for live monitoring and manual intervention.
  • tasks/: Task definitions stored in OpenEnv-compatible JSON schema.
  • inference.py: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace transformers pipelines.

Results

Reward Curve Episode reward over training steps. Rolling 10-step average shown.

Reward by Task Average reward per task category.

Model Easy Medium Hard Avg
GPT-4o - - - -
Qwen-72B - - - -
Llama-3-8B - - - -

Why It Matters

Every production coding AI needs to debug, not just write. There is no other standardized RL environment that trains and benchmarks iterative repair. The hybrid grader β€” deterministic test execution plus LLM quality judgment β€” means agents cannot game the reward by memorising solutions or producing syntactically correct but semantically wrong fixes.

Setup

  1. Install Dependencies:

    pip install -r requirements.txt
    cd frontend && npm install
    
  2. Generate New Tasks: To populate the extended task categories (type_errors and security_bugs), run the task generator. This must be run first or the new task categories won't exist.

    python create_tasks.py
    

Usage

1. Run the Backend Server

The server is required for both the frontend dashboard and RL training.

uvicorn server.app:app --port 7860

2. Run the Frontend Dashboard

cd frontend
npm run dev

Navigate to http://localhost:3000 to access the live RL monitoring dashboard.

3. Run Inference Evaluation

You can evaluate a local agent or pipeline programmatically via inference.py.

Using OpenAI-Compatible Endpoints (e.g., Ollama or vLLM):

export API_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="codellama"
python inference.py --backend openai

Using HuggingFace Transformers (Local pipeline):

export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B"
python inference.py --backend hf

Reward Analysis

As your agent interacts with the environment, inference logs are automatically written to rewards_log.csv. To visualize the reward curves over training steps and average rewards by task category, run:

python plot_rewards.py

This generates reward_curve.png and reward_by_task.png in the results/ directory.

OpenEnv Compatibility

This benchmark strictly adheres to the OpenEnv specification. See openenv.yaml for full configuration details.

Links