havinashpatil commited on
Commit
a448db8
·
1 Parent(s): 03defc2

Complete all tasks: Adaptive curriculum, GRPO, React frontend, LLM-as-a-judge

Browse files
README.md CHANGED
@@ -1,86 +1,80 @@
1
- ---
2
- title: CodeArena RL Agent
3
- emoji: 🤖
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- ---
9
 
10
- # CodeArena: RL Benchmark for Autonomous Code Repair
11
 
12
- CodeArena is an OpenEnv-compatible reinforcement learning benchmark for testing the capability of autonomous agents to debug, fix, and optimize broken code.
13
 
14
- ## Environment Description
 
 
 
 
 
 
 
 
15
 
16
- The environment tests the agent on 3 difficulties of tasks:
17
- 1. **Easy**: Correcting syntax errors.
18
- 2. **Medium**: Fixing logical bugs.
19
- 3. **Hard**: Algorithm and efficiency optimization.
20
 
21
- The agent interacts with the environment by receiving observations of the buggy code and submitting proposed fixes. Execution runs in a sandboxed subprocess.
 
 
 
22
 
23
- ### Observation Format
24
- `buggy_code` (string): The current state of the source code.
25
- `error_log` (string): Standard error output or runtime exceptions from previous attempts.
26
- `test_results` (string): Count of passed vs total unit tests.
27
- `previous_attempts` (list of strings): Complete history of fixes proposed during the episode.
28
 
29
- ### Action Format
30
- `proposed_fix` (string): The complete raw Python code to overwrite the buggy file.
 
 
 
31
 
32
- ### Reward Function
33
- The reward dynamically evaluates partial success bounded universally between 0.0 and 1.0:
34
- - `0.3 * compile_score`: Full points if code compiles successfully.
35
- - `0.4 * test_pass_ratio`: Proportional points based on the number of passed unit tests.
36
- - `0.3 * efficiency_score`: Proportional points based on the execution speed relative to an established optimal algorithmic runtime. (Efficiency is only considered if all tests pass).
37
 
38
- ## API Endpoints
39
 
40
- | Method | Path | Description |
41
- |--------|----------|--------------------------------------|
42
- | POST | `/reset` | Reset env. Body: `{"task_id":"easy"}`|
43
- | POST | `/step` | Submit fix. Body: `{"proposed_fix":"..."}` |
44
- | GET | `/state` | Get current observation |
45
- | GET | `/` | Health check |
46
-
47
- ## Setup Instructions
48
 
49
- ### Local Setup
50
  ```bash
51
- python -m venv venv
52
- source venv/bin/activate
53
- pip install -r requirements.txt
54
- uvicorn server.app:app --reload --port 7860
55
  ```
 
 
 
 
56
 
57
- ### Docker Build & Run
58
  ```bash
59
- docker build -t codearena .
60
- docker run -p 7860:7860 codearena
 
61
  ```
62
 
63
- ### Test the /reset endpoint
64
  ```bash
65
- curl -X POST http://localhost:7860/reset \
66
- -H "Content-Type: application/json" \
67
- -d '{"task_id": "easy"}'
68
  ```
69
 
70
- ## Example Inference Run
71
 
72
- To test the environment with OpenAI's API:
 
73
  ```bash
74
- export OPENAI_API_KEY="sk-..."
75
- python inference.py
76
  ```
 
77
 
78
- The script will produce structured logging:
79
- ```
80
- [START] Initializing CodeArena inference logging
81
- [STEP] Beginning Step 1
82
- [STEP] Action taken. Reward received: 0.700. Task ID: easy-1
83
- [STEP] Beginning Step 2
84
- [STEP] Action taken. Reward received: 1.000. Task ID: easy-1
85
- [END] Inference Complete. Executed 2 step(s).
86
- ```
 
1
+ # CodeArena RL Benchmark
 
 
 
 
 
 
 
2
 
3
+ CodeArena is an OpenEnv-compatible reinforcement learning benchmark for autonomous code repair. In this environment, an agent receives buggy Python code, proposes fixes, and is iteratively evaluated based on test execution feedback and LLM-based quality metrics.
4
 
5
+ ## Features
6
 
7
+ - **Adaptive Curriculum**: The environment supports an `auto` difficulty mode that dynamically scales task complexity (`easy`, `medium`, `hard`) based on the agent's recent rolling average rewards.
8
+ - **Complex Shaped Rewards**: Rewards are a weighted composite of:
9
+ - `compile_score` (0.2)
10
+ - `test_pass_ratio` (0.4)
11
+ - `efficiency_score` (0.1)
12
+ - `llm_judge_score` (0.3): Correctness, Security, and Code Quality evaluated via LLM-as-a-judge.
13
+ - **Novelty & Step Penalties**: The agent receives penalties for repeating identical failed fixes or taking too many steps.
14
+ - **Extensive Task Categories**: Includes standard algorithmic tasks, `type_errors`, and `security_bugs`.
15
+ - **Live React Frontend**: Connect a local LLM (like Ollama) or HuggingFace models to interactively visualize step-by-step progress, execution outputs, and live reward components.
16
 
17
+ ## Architecture
 
 
 
18
 
19
+ - `server/`: FastAPI backend acting as the OpenEnv entrypoint. Handles state, execution sandbox (`executor.py`), and reward grading (`grader.py`).
20
+ - `frontend/`: React + Vite frontend for live monitoring and manual intervention.
21
+ - `tasks/`: Task definitions stored in OpenEnv-compatible JSON schema.
22
+ - `inference.py`: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace `transformers` pipelines.
23
 
24
+ ## Setup
 
 
 
 
25
 
26
+ 1. **Install Dependencies:**
27
+ ```bash
28
+ pip install -r requirements.txt
29
+ cd frontend && npm install
30
+ ```
31
 
32
+ 2. **Generate New Tasks:**
33
+ To populate the extended task categories (`type_errors` and `security_bugs`), run:
34
+ ```bash
35
+ python create_tasks.py
36
+ ```
37
 
38
+ ## Usage
39
 
40
+ ### 1. Run the Backend Server
41
+ The server is required for both the frontend dashboard and RL training.
42
+ ```bash
43
+ uvicorn server.app:app --port 7860
44
+ ```
 
 
 
45
 
46
+ ### 2. Run the Frontend Dashboard
47
  ```bash
48
+ cd frontend
49
+ npm run dev
 
 
50
  ```
51
+ Navigate to `http://localhost:3000` to access the live RL monitoring dashboard.
52
+
53
+ ### 3. Run Inference Evaluation
54
+ You can evaluate a local agent or pipeline programmatically via `inference.py`.
55
 
56
+ **Using OpenAI-Compatible Endpoints (e.g., Ollama or vLLM):**
57
  ```bash
58
+ export API_BASE_URL="http://localhost:11434/v1"
59
+ export MODEL_NAME="codellama"
60
+ python inference.py --backend openai
61
  ```
62
 
63
+ **Using HuggingFace Transformers (Local pipeline):**
64
  ```bash
65
+ export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B"
66
+ python inference.py --backend hf
 
67
  ```
68
 
69
+ ## Reward Analysis
70
 
71
+ As your agent interacts with the environment, inference logs are automatically written to `rewards_log.csv`.
72
+ To visualize the reward curves over training steps and average rewards by task category, run:
73
  ```bash
74
+ python plot_rewards.py
 
75
  ```
76
+ This generates `reward_curve.png` and `reward_by_task.png` in the `results/` directory.
77
 
78
+ ## OpenEnv Compatibility
79
+
80
+ This benchmark strictly adheres to the OpenEnv specification. See `openenv.yaml` for full configuration details.
 
 
 
 
 
 
create_tasks.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+
4
+ base_dir = "e:/meta/tasks"
5
+ os.makedirs(os.path.join(base_dir, "type_errors"), exist_ok=True)
6
+ os.makedirs(os.path.join(base_dir, "security_bugs"), exist_ok=True)
7
+
8
+ def write_task(folder, name, task_id, difficulty, desc, buggy, test):
9
+ py_path = os.path.join(base_dir, folder, f"{name}.py")
10
+ json_path = os.path.join(base_dir, folder, f"{name}.json")
11
+
12
+ py_content = f'''from server.models import TaskInfo
13
+
14
+ TASK = TaskInfo(
15
+ task_id="{task_id}",
16
+ difficulty="{difficulty}",
17
+ description="{desc}",
18
+ buggy_code="""{buggy}""",
19
+ test_code="""{test}""",
20
+ optimal_time_seconds=0.05
21
+ )
22
+ '''
23
+ with open(py_path, "w", encoding="utf-8") as f:
24
+ f.write(py_content)
25
+
26
+ json_content = {
27
+ "task_id": task_id,
28
+ "difficulty": difficulty,
29
+ "description": desc,
30
+ "buggy_code": buggy,
31
+ "test_code": test,
32
+ "optimal_time_seconds": 0.05
33
+ }
34
+ with open(json_path, "w", encoding="utf-8") as f:
35
+ json.dump(json_content, f, indent=2)
36
+
37
+
38
+ # Type Error 1
39
+ write_task("type_errors", "type_error_1", "type_errors-1", "type_errors",
40
+ "Fix the function to sum a list of numbers that might be passed as strings. It currently tries to add int and str.",
41
+ "def sum_all(items):\n total = 0\n for item in items:\n total = total + item\n return total",
42
+ "\nimport unittest\nclass TestTypeError1(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(sum_all([1, 2, 3]), 6)\n def test_strings(self):\n self.assertEqual(sum_all(['1', '2', '3']), 6)\n def test_mixed(self):\n self.assertEqual(sum_all([1, '2', 3]), 6)\n")
43
+
44
+ # Type Error 2
45
+ write_task("type_errors", "type_error_2", "type_errors-2", "type_errors",
46
+ "Fix the function to count frequencies. It incorrectly calls .append() on a dict.",
47
+ "def count_frequencies(words):\n counts = {}\n for word in words:\n if word not in counts:\n counts.append({word: 1})\n else:\n counts[word] += 1\n return counts",
48
+ "\nimport unittest\nclass TestTypeError2(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(count_frequencies(['apple', 'banana', 'apple']), {'apple': 2, 'banana': 1})\n def test_empty(self):\n self.assertEqual(count_frequencies([]), {})\n")
49
+
50
+ # Type Error 3
51
+ write_task("type_errors", "type_error_3", "type_errors-3", "type_errors",
52
+ "Fix the function to format names. It incorrectly calls .upper() on an int ID.",
53
+ "def format_records(records):\n formatted = []\n for user_id, name in records:\n formatted.append(f\"{user_id.upper()} - {name.upper()}\")\n return formatted",
54
+ "\nimport unittest\nclass TestTypeError3(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(format_records([(1, 'alice'), (2, 'bob')]), ['1 - ALICE', '2 - BOB'])\n")
55
+
56
+
57
+ # Security Bug 1
58
+ write_task("security_bugs", "security_bug_1", "security_bugs-1", "security_bugs",
59
+ "Fix the function to parse JSON safely without using eval().",
60
+ "import json\ndef parse_user_data(data_string):\n return eval(data_string)",
61
+ "\nimport unittest\nimport inspect\nclass TestSecurity1(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(parse_user_data('{\"name\": \"alice\"}'), {\"name\": \"alice\"})\n def test_security(self):\n source = inspect.getsource(parse_user_data)\n self.assertNotIn(\"eval(\", source)\n")
62
+
63
+ # Security Bug 2
64
+ write_task("security_bugs", "security_bug_2", "security_bugs-2", "security_bugs",
65
+ "Remove the hardcoded secret token and load it from the os.environ dictionary as 'API_TOKEN'.",
66
+ "import os\ndef get_api_token():\n token = \"secret_12345\"\n return token",
67
+ "\nimport unittest\nimport inspect\nimport os\nclass TestSecurity2(unittest.TestCase):\n def test_normal(self):\n os.environ['API_TOKEN'] = 'my_secure_token'\n self.assertEqual(get_api_token(), 'my_secure_token')\n def test_security(self):\n source = inspect.getsource(get_api_token)\n self.assertNotIn(\"secret_12345\", source)\n")
68
+
69
+ # Security Bug 3
70
+ write_task("security_bugs", "security_bug_3", "security_bugs-3", "security_bugs",
71
+ "Fix the ping command to avoid shell injection. Use a list of arguments and shell=False.",
72
+ "import subprocess\ndef ping_host(host):\n return subprocess.check_output(f\"ping -c 1 {host}\", shell=True)",
73
+ "\nimport unittest\nimport inspect\nclass TestSecurity3(unittest.TestCase):\n def test_security(self):\n source = inspect.getsource(ping_host)\n self.assertNotIn(\"shell=True\", source.replace(\" \", \"\"))\n self.assertIn(\"[\", source)\n")
74
+
75
+ # Rewrite __init__.py
76
+ init_content = """from .easy import EASY_TASK
77
+ from .medium import MEDIUM_TASK
78
+ from .hard import HARD_TASK
79
+ from .type_errors.type_error_1 import TASK as TE1
80
+ from .type_errors.type_error_2 import TASK as TE2
81
+ from .type_errors.type_error_3 import TASK as TE3
82
+ from .security_bugs.security_bug_1 import TASK as SB1
83
+ from .security_bugs.security_bug_2 import TASK as SB2
84
+ from .security_bugs.security_bug_3 import TASK as SB3
85
+
86
+ ALL_TASKS = [EASY_TASK, MEDIUM_TASK, HARD_TASK, TE1, TE2, TE3, SB1, SB2, SB3]
87
+ """
88
+
89
+ with open(os.path.join(base_dir, "__init__.py"), "w", encoding="utf-8") as f:
90
+ f.write(init_content)
91
+
92
+ print("Tasks generated successfully!")
frontend/package-lock.json CHANGED
@@ -9,7 +9,8 @@
9
  "version": "0.0.0",
10
  "dependencies": {
11
  "react": "^19.2.5",
12
- "react-dom": "^19.2.5"
 
13
  },
14
  "devDependencies": {
15
  "@eslint/js": "^9.39.4",
@@ -264,31 +265,6 @@
264
  "node": ">=6.9.0"
265
  }
266
  },
267
- "node_modules/@emnapi/core": {
268
- "version": "1.9.2",
269
- "resolved": "https://registry.npmjs.org/@emnapi/core/-/core-1.9.2.tgz",
270
- "integrity": "sha512-UC+ZhH3XtczQYfOlu3lNEkdW/p4dsJ1r/bP7H8+rhao3TTTMO1ATq/4DdIi23XuGoFY+Cz0JmCbdVl0hz9jZcA==",
271
- "dev": true,
272
- "license": "MIT",
273
- "optional": true,
274
- "peer": true,
275
- "dependencies": {
276
- "@emnapi/wasi-threads": "1.2.1",
277
- "tslib": "^2.4.0"
278
- }
279
- },
280
- "node_modules/@emnapi/runtime": {
281
- "version": "1.9.2",
282
- "resolved": "https://registry.npmjs.org/@emnapi/runtime/-/runtime-1.9.2.tgz",
283
- "integrity": "sha512-3U4+MIWHImeyu1wnmVygh5WlgfYDtyf0k8AbLhMFxOipihf6nrWC4syIm/SwEeec0mNSafiiNnMJwbza/Is6Lw==",
284
- "dev": true,
285
- "license": "MIT",
286
- "optional": true,
287
- "peer": true,
288
- "dependencies": {
289
- "tslib": "^2.4.0"
290
- }
291
- },
292
  "node_modules/@emnapi/wasi-threads": {
293
  "version": "1.2.1",
294
  "resolved": "https://registry.npmjs.org/@emnapi/wasi-threads/-/wasi-threads-1.2.1.tgz",
@@ -602,6 +578,42 @@
602
  "url": "https://github.com/sponsors/Boshen"
603
  }
604
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
605
  "node_modules/@rolldown/binding-android-arm64": {
606
  "version": "1.0.0-rc.16",
607
  "resolved": "https://registry.npmjs.org/@rolldown/binding-android-arm64/-/binding-android-arm64-1.0.0-rc.16.tgz",
@@ -866,6 +878,18 @@
866
  "dev": true,
867
  "license": "MIT"
868
  },
 
 
 
 
 
 
 
 
 
 
 
 
869
  "node_modules/@tybys/wasm-util": {
870
  "version": "0.10.1",
871
  "resolved": "https://registry.npmjs.org/@tybys/wasm-util/-/wasm-util-0.10.1.tgz",
@@ -877,6 +901,69 @@
877
  "tslib": "^2.4.0"
878
  }
879
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
880
  "node_modules/@types/estree": {
881
  "version": "1.0.8",
882
  "resolved": "https://registry.npmjs.org/@types/estree/-/estree-1.0.8.tgz",
@@ -895,7 +982,7 @@
895
  "version": "19.2.14",
896
  "resolved": "https://registry.npmjs.org/@types/react/-/react-19.2.14.tgz",
897
  "integrity": "sha512-ilcTH/UniCkMdtexkoCN0bI7pMcJDvmQFPvuPvmEaYA/NSfFTAgdUSLAoVjaRJm7+6PvcM+q1zYOwS4wTYMF9w==",
898
- "dev": true,
899
  "license": "MIT",
900
  "peer": true,
901
  "dependencies": {
@@ -912,6 +999,12 @@
912
  "@types/react": "^19.2.0"
913
  }
914
  },
 
 
 
 
 
 
915
  "node_modules/@vitejs/plugin-react": {
916
  "version": "6.0.1",
917
  "resolved": "https://registry.npmjs.org/@vitejs/plugin-react/-/plugin-react-6.0.1.tgz",
@@ -1116,6 +1209,15 @@
1116
  "url": "https://github.com/chalk/chalk?sponsor=1"
1117
  }
1118
  },
 
 
 
 
 
 
 
 
 
1119
  "node_modules/color-convert": {
1120
  "version": "2.0.1",
1121
  "resolved": "https://registry.npmjs.org/color-convert/-/color-convert-2.0.1.tgz",
@@ -1169,9 +1271,130 @@
1169
  "version": "3.2.3",
1170
  "resolved": "https://registry.npmjs.org/csstype/-/csstype-3.2.3.tgz",
1171
  "integrity": "sha512-z1HGKcYy2xA8AGQfwrn0PAy+PB7X/GSj3UVJW9qKyn43xWa+gl5nXmU4qqLMRzWVLFC8KusUX8T/0kCiOYpAIQ==",
1172
- "dev": true,
1173
  "license": "MIT"
1174
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1175
  "node_modules/debug": {
1176
  "version": "4.4.3",
1177
  "resolved": "https://registry.npmjs.org/debug/-/debug-4.4.3.tgz",
@@ -1190,6 +1413,12 @@
1190
  }
1191
  }
1192
  },
 
 
 
 
 
 
1193
  "node_modules/deep-is": {
1194
  "version": "0.1.4",
1195
  "resolved": "https://registry.npmjs.org/deep-is/-/deep-is-0.1.4.tgz",
@@ -1214,6 +1443,16 @@
1214
  "dev": true,
1215
  "license": "ISC"
1216
  },
 
 
 
 
 
 
 
 
 
 
1217
  "node_modules/escalade": {
1218
  "version": "3.2.0",
1219
  "resolved": "https://registry.npmjs.org/escalade/-/escalade-3.2.0.tgz",
@@ -1422,6 +1661,12 @@
1422
  "node": ">=0.10.0"
1423
  }
1424
  },
 
 
 
 
 
 
1425
  "node_modules/fast-deep-equal": {
1426
  "version": "3.1.3",
1427
  "resolved": "https://registry.npmjs.org/fast-deep-equal/-/fast-deep-equal-3.1.3.tgz",
@@ -1600,6 +1845,16 @@
1600
  "node": ">= 4"
1601
  }
1602
  },
 
 
 
 
 
 
 
 
 
 
1603
  "node_modules/import-fresh": {
1604
  "version": "3.3.1",
1605
  "resolved": "https://registry.npmjs.org/import-fresh/-/import-fresh-3.3.1.tgz",
@@ -1627,6 +1882,15 @@
1627
  "node": ">=0.8.19"
1628
  }
1629
  },
 
 
 
 
 
 
 
 
 
1630
  "node_modules/is-extglob": {
1631
  "version": "2.1.1",
1632
  "resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-2.1.1.tgz",
@@ -2263,6 +2527,7 @@
2263
  "resolved": "https://registry.npmjs.org/react-dom/-/react-dom-19.2.5.tgz",
2264
  "integrity": "sha512-J5bAZz+DXMMwW/wV3xzKke59Af6CHY7G4uYLN1OvBcKEsWOs4pQExj86BBKamxl/Ik5bx9whOrvBlSDfWzgSag==",
2265
  "license": "MIT",
 
2266
  "dependencies": {
2267
  "scheduler": "^0.27.0"
2268
  },
@@ -2270,6 +2535,89 @@
2270
  "react": "^19.2.5"
2271
  }
2272
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2273
  "node_modules/resolve-from": {
2274
  "version": "4.0.0",
2275
  "resolved": "https://registry.npmjs.org/resolve-from/-/resolve-from-4.0.0.tgz",
@@ -2396,6 +2744,12 @@
2396
  "node": ">=8"
2397
  }
2398
  },
 
 
 
 
 
 
2399
  "node_modules/tinyglobby": {
2400
  "version": "0.2.16",
2401
  "resolved": "https://registry.npmjs.org/tinyglobby/-/tinyglobby-0.2.16.tgz",
@@ -2475,6 +2829,37 @@
2475
  "punycode": "^2.1.0"
2476
  }
2477
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2478
  "node_modules/vite": {
2479
  "version": "8.0.9",
2480
  "resolved": "https://registry.npmjs.org/vite/-/vite-8.0.9.tgz",
 
9
  "version": "0.0.0",
10
  "dependencies": {
11
  "react": "^19.2.5",
12
+ "react-dom": "^19.2.5",
13
+ "recharts": "^3.8.1"
14
  },
15
  "devDependencies": {
16
  "@eslint/js": "^9.39.4",
 
265
  "node": ">=6.9.0"
266
  }
267
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  "node_modules/@emnapi/wasi-threads": {
269
  "version": "1.2.1",
270
  "resolved": "https://registry.npmjs.org/@emnapi/wasi-threads/-/wasi-threads-1.2.1.tgz",
 
578
  "url": "https://github.com/sponsors/Boshen"
579
  }
580
  },
581
+ "node_modules/@reduxjs/toolkit": {
582
+ "version": "2.11.2",
583
+ "resolved": "https://registry.npmjs.org/@reduxjs/toolkit/-/toolkit-2.11.2.tgz",
584
+ "integrity": "sha512-Kd6kAHTA6/nUpp8mySPqj3en3dm0tdMIgbttnQ1xFMVpufoj+ADi8pXLBsd4xzTRHQa7t/Jv8W5UnCuW4kuWMQ==",
585
+ "license": "MIT",
586
+ "dependencies": {
587
+ "@standard-schema/spec": "^1.0.0",
588
+ "@standard-schema/utils": "^0.3.0",
589
+ "immer": "^11.0.0",
590
+ "redux": "^5.0.1",
591
+ "redux-thunk": "^3.1.0",
592
+ "reselect": "^5.1.0"
593
+ },
594
+ "peerDependencies": {
595
+ "react": "^16.9.0 || ^17.0.0 || ^18 || ^19",
596
+ "react-redux": "^7.2.1 || ^8.1.3 || ^9.0.0"
597
+ },
598
+ "peerDependenciesMeta": {
599
+ "react": {
600
+ "optional": true
601
+ },
602
+ "react-redux": {
603
+ "optional": true
604
+ }
605
+ }
606
+ },
607
+ "node_modules/@reduxjs/toolkit/node_modules/immer": {
608
+ "version": "11.1.4",
609
+ "resolved": "https://registry.npmjs.org/immer/-/immer-11.1.4.tgz",
610
+ "integrity": "sha512-XREFCPo6ksxVzP4E0ekD5aMdf8WMwmdNaz6vuvxgI40UaEiu6q3p8X52aU6GdyvLY3XXX/8R7JOTXStz/nBbRw==",
611
+ "license": "MIT",
612
+ "funding": {
613
+ "type": "opencollective",
614
+ "url": "https://opencollective.com/immer"
615
+ }
616
+ },
617
  "node_modules/@rolldown/binding-android-arm64": {
618
  "version": "1.0.0-rc.16",
619
  "resolved": "https://registry.npmjs.org/@rolldown/binding-android-arm64/-/binding-android-arm64-1.0.0-rc.16.tgz",
 
878
  "dev": true,
879
  "license": "MIT"
880
  },
881
+ "node_modules/@standard-schema/spec": {
882
+ "version": "1.1.0",
883
+ "resolved": "https://registry.npmjs.org/@standard-schema/spec/-/spec-1.1.0.tgz",
884
+ "integrity": "sha512-l2aFy5jALhniG5HgqrD6jXLi/rUWrKvqN/qJx6yoJsgKhblVd+iqqU4RCXavm/jPityDo5TCvKMnpjKnOriy0w==",
885
+ "license": "MIT"
886
+ },
887
+ "node_modules/@standard-schema/utils": {
888
+ "version": "0.3.0",
889
+ "resolved": "https://registry.npmjs.org/@standard-schema/utils/-/utils-0.3.0.tgz",
890
+ "integrity": "sha512-e7Mew686owMaPJVNNLs55PUvgz371nKgwsc4vxE49zsODpJEnxgxRo2y/OKrqueavXgZNMDVj3DdHFlaSAeU8g==",
891
+ "license": "MIT"
892
+ },
893
  "node_modules/@tybys/wasm-util": {
894
  "version": "0.10.1",
895
  "resolved": "https://registry.npmjs.org/@tybys/wasm-util/-/wasm-util-0.10.1.tgz",
 
901
  "tslib": "^2.4.0"
902
  }
903
  },
904
+ "node_modules/@types/d3-array": {
905
+ "version": "3.2.2",
906
+ "resolved": "https://registry.npmjs.org/@types/d3-array/-/d3-array-3.2.2.tgz",
907
+ "integrity": "sha512-hOLWVbm7uRza0BYXpIIW5pxfrKe0W+D5lrFiAEYR+pb6w3N2SwSMaJbXdUfSEv+dT4MfHBLtn5js0LAWaO6otw==",
908
+ "license": "MIT"
909
+ },
910
+ "node_modules/@types/d3-color": {
911
+ "version": "3.1.3",
912
+ "resolved": "https://registry.npmjs.org/@types/d3-color/-/d3-color-3.1.3.tgz",
913
+ "integrity": "sha512-iO90scth9WAbmgv7ogoq57O9YpKmFBbmoEoCHDB2xMBY0+/KVrqAaCDyCE16dUspeOvIxFFRI+0sEtqDqy2b4A==",
914
+ "license": "MIT"
915
+ },
916
+ "node_modules/@types/d3-ease": {
917
+ "version": "3.0.2",
918
+ "resolved": "https://registry.npmjs.org/@types/d3-ease/-/d3-ease-3.0.2.tgz",
919
+ "integrity": "sha512-NcV1JjO5oDzoK26oMzbILE6HW7uVXOHLQvHshBUW4UMdZGfiY6v5BeQwh9a9tCzv+CeefZQHJt5SRgK154RtiA==",
920
+ "license": "MIT"
921
+ },
922
+ "node_modules/@types/d3-interpolate": {
923
+ "version": "3.0.4",
924
+ "resolved": "https://registry.npmjs.org/@types/d3-interpolate/-/d3-interpolate-3.0.4.tgz",
925
+ "integrity": "sha512-mgLPETlrpVV1YRJIglr4Ez47g7Yxjl1lj7YKsiMCb27VJH9W8NVM6Bb9d8kkpG/uAQS5AmbA48q2IAolKKo1MA==",
926
+ "license": "MIT",
927
+ "dependencies": {
928
+ "@types/d3-color": "*"
929
+ }
930
+ },
931
+ "node_modules/@types/d3-path": {
932
+ "version": "3.1.1",
933
+ "resolved": "https://registry.npmjs.org/@types/d3-path/-/d3-path-3.1.1.tgz",
934
+ "integrity": "sha512-VMZBYyQvbGmWyWVea0EHs/BwLgxc+MKi1zLDCONksozI4YJMcTt8ZEuIR4Sb1MMTE8MMW49v0IwI5+b7RmfWlg==",
935
+ "license": "MIT"
936
+ },
937
+ "node_modules/@types/d3-scale": {
938
+ "version": "4.0.9",
939
+ "resolved": "https://registry.npmjs.org/@types/d3-scale/-/d3-scale-4.0.9.tgz",
940
+ "integrity": "sha512-dLmtwB8zkAeO/juAMfnV+sItKjlsw2lKdZVVy6LRr0cBmegxSABiLEpGVmSJJ8O08i4+sGR6qQtb6WtuwJdvVw==",
941
+ "license": "MIT",
942
+ "dependencies": {
943
+ "@types/d3-time": "*"
944
+ }
945
+ },
946
+ "node_modules/@types/d3-shape": {
947
+ "version": "3.1.8",
948
+ "resolved": "https://registry.npmjs.org/@types/d3-shape/-/d3-shape-3.1.8.tgz",
949
+ "integrity": "sha512-lae0iWfcDeR7qt7rA88BNiqdvPS5pFVPpo5OfjElwNaT2yyekbM0C9vK+yqBqEmHr6lDkRnYNoTBYlAgJa7a4w==",
950
+ "license": "MIT",
951
+ "dependencies": {
952
+ "@types/d3-path": "*"
953
+ }
954
+ },
955
+ "node_modules/@types/d3-time": {
956
+ "version": "3.0.4",
957
+ "resolved": "https://registry.npmjs.org/@types/d3-time/-/d3-time-3.0.4.tgz",
958
+ "integrity": "sha512-yuzZug1nkAAaBlBBikKZTgzCeA+k1uy4ZFwWANOfKw5z5LRhV0gNA7gNkKm7HoK+HRN0wX3EkxGk0fpbWhmB7g==",
959
+ "license": "MIT"
960
+ },
961
+ "node_modules/@types/d3-timer": {
962
+ "version": "3.0.2",
963
+ "resolved": "https://registry.npmjs.org/@types/d3-timer/-/d3-timer-3.0.2.tgz",
964
+ "integrity": "sha512-Ps3T8E8dZDam6fUyNiMkekK3XUsaUEik+idO9/YjPtfj2qruF8tFBXS7XhtE4iIXBLxhmLjP3SXpLhVf21I9Lw==",
965
+ "license": "MIT"
966
+ },
967
  "node_modules/@types/estree": {
968
  "version": "1.0.8",
969
  "resolved": "https://registry.npmjs.org/@types/estree/-/estree-1.0.8.tgz",
 
982
  "version": "19.2.14",
983
  "resolved": "https://registry.npmjs.org/@types/react/-/react-19.2.14.tgz",
984
  "integrity": "sha512-ilcTH/UniCkMdtexkoCN0bI7pMcJDvmQFPvuPvmEaYA/NSfFTAgdUSLAoVjaRJm7+6PvcM+q1zYOwS4wTYMF9w==",
985
+ "devOptional": true,
986
  "license": "MIT",
987
  "peer": true,
988
  "dependencies": {
 
999
  "@types/react": "^19.2.0"
1000
  }
1001
  },
1002
+ "node_modules/@types/use-sync-external-store": {
1003
+ "version": "0.0.6",
1004
+ "resolved": "https://registry.npmjs.org/@types/use-sync-external-store/-/use-sync-external-store-0.0.6.tgz",
1005
+ "integrity": "sha512-zFDAD+tlpf2r4asuHEj0XH6pY6i0g5NeAHPn+15wk3BV6JA69eERFXC1gyGThDkVa1zCyKr5jox1+2LbV/AMLg==",
1006
+ "license": "MIT"
1007
+ },
1008
  "node_modules/@vitejs/plugin-react": {
1009
  "version": "6.0.1",
1010
  "resolved": "https://registry.npmjs.org/@vitejs/plugin-react/-/plugin-react-6.0.1.tgz",
 
1209
  "url": "https://github.com/chalk/chalk?sponsor=1"
1210
  }
1211
  },
1212
+ "node_modules/clsx": {
1213
+ "version": "2.1.1",
1214
+ "resolved": "https://registry.npmjs.org/clsx/-/clsx-2.1.1.tgz",
1215
+ "integrity": "sha512-eYm0QWBtUrBWZWG0d386OGAw16Z995PiOVo2B7bjWSbHedGl5e0ZWaq65kOGgUSNesEIDkB9ISbTg/JK9dhCZA==",
1216
+ "license": "MIT",
1217
+ "engines": {
1218
+ "node": ">=6"
1219
+ }
1220
+ },
1221
  "node_modules/color-convert": {
1222
  "version": "2.0.1",
1223
  "resolved": "https://registry.npmjs.org/color-convert/-/color-convert-2.0.1.tgz",
 
1271
  "version": "3.2.3",
1272
  "resolved": "https://registry.npmjs.org/csstype/-/csstype-3.2.3.tgz",
1273
  "integrity": "sha512-z1HGKcYy2xA8AGQfwrn0PAy+PB7X/GSj3UVJW9qKyn43xWa+gl5nXmU4qqLMRzWVLFC8KusUX8T/0kCiOYpAIQ==",
1274
+ "devOptional": true,
1275
  "license": "MIT"
1276
  },
1277
+ "node_modules/d3-array": {
1278
+ "version": "3.2.4",
1279
+ "resolved": "https://registry.npmjs.org/d3-array/-/d3-array-3.2.4.tgz",
1280
+ "integrity": "sha512-tdQAmyA18i4J7wprpYq8ClcxZy3SC31QMeByyCFyRt7BVHdREQZ5lpzoe5mFEYZUWe+oq8HBvk9JjpibyEV4Jg==",
1281
+ "license": "ISC",
1282
+ "dependencies": {
1283
+ "internmap": "1 - 2"
1284
+ },
1285
+ "engines": {
1286
+ "node": ">=12"
1287
+ }
1288
+ },
1289
+ "node_modules/d3-color": {
1290
+ "version": "3.1.0",
1291
+ "resolved": "https://registry.npmjs.org/d3-color/-/d3-color-3.1.0.tgz",
1292
+ "integrity": "sha512-zg/chbXyeBtMQ1LbD/WSoW2DpC3I0mpmPdW+ynRTj/x2DAWYrIY7qeZIHidozwV24m4iavr15lNwIwLxRmOxhA==",
1293
+ "license": "ISC",
1294
+ "engines": {
1295
+ "node": ">=12"
1296
+ }
1297
+ },
1298
+ "node_modules/d3-ease": {
1299
+ "version": "3.0.1",
1300
+ "resolved": "https://registry.npmjs.org/d3-ease/-/d3-ease-3.0.1.tgz",
1301
+ "integrity": "sha512-wR/XK3D3XcLIZwpbvQwQ5fK+8Ykds1ip7A2Txe0yxncXSdq1L9skcG7blcedkOX+ZcgxGAmLX1FrRGbADwzi0w==",
1302
+ "license": "BSD-3-Clause",
1303
+ "engines": {
1304
+ "node": ">=12"
1305
+ }
1306
+ },
1307
+ "node_modules/d3-format": {
1308
+ "version": "3.1.2",
1309
+ "resolved": "https://registry.npmjs.org/d3-format/-/d3-format-3.1.2.tgz",
1310
+ "integrity": "sha512-AJDdYOdnyRDV5b6ArilzCPPwc1ejkHcoyFarqlPqT7zRYjhavcT3uSrqcMvsgh2CgoPbK3RCwyHaVyxYcP2Arg==",
1311
+ "license": "ISC",
1312
+ "engines": {
1313
+ "node": ">=12"
1314
+ }
1315
+ },
1316
+ "node_modules/d3-interpolate": {
1317
+ "version": "3.0.1",
1318
+ "resolved": "https://registry.npmjs.org/d3-interpolate/-/d3-interpolate-3.0.1.tgz",
1319
+ "integrity": "sha512-3bYs1rOD33uo8aqJfKP3JWPAibgw8Zm2+L9vBKEHJ2Rg+viTR7o5Mmv5mZcieN+FRYaAOWX5SJATX6k1PWz72g==",
1320
+ "license": "ISC",
1321
+ "dependencies": {
1322
+ "d3-color": "1 - 3"
1323
+ },
1324
+ "engines": {
1325
+ "node": ">=12"
1326
+ }
1327
+ },
1328
+ "node_modules/d3-path": {
1329
+ "version": "3.1.0",
1330
+ "resolved": "https://registry.npmjs.org/d3-path/-/d3-path-3.1.0.tgz",
1331
+ "integrity": "sha512-p3KP5HCf/bvjBSSKuXid6Zqijx7wIfNW+J/maPs+iwR35at5JCbLUT0LzF1cnjbCHWhqzQTIN2Jpe8pRebIEFQ==",
1332
+ "license": "ISC",
1333
+ "engines": {
1334
+ "node": ">=12"
1335
+ }
1336
+ },
1337
+ "node_modules/d3-scale": {
1338
+ "version": "4.0.2",
1339
+ "resolved": "https://registry.npmjs.org/d3-scale/-/d3-scale-4.0.2.tgz",
1340
+ "integrity": "sha512-GZW464g1SH7ag3Y7hXjf8RoUuAFIqklOAq3MRl4OaWabTFJY9PN/E1YklhXLh+OQ3fM9yS2nOkCoS+WLZ6kvxQ==",
1341
+ "license": "ISC",
1342
+ "dependencies": {
1343
+ "d3-array": "2.10.0 - 3",
1344
+ "d3-format": "1 - 3",
1345
+ "d3-interpolate": "1.2.0 - 3",
1346
+ "d3-time": "2.1.1 - 3",
1347
+ "d3-time-format": "2 - 4"
1348
+ },
1349
+ "engines": {
1350
+ "node": ">=12"
1351
+ }
1352
+ },
1353
+ "node_modules/d3-shape": {
1354
+ "version": "3.2.0",
1355
+ "resolved": "https://registry.npmjs.org/d3-shape/-/d3-shape-3.2.0.tgz",
1356
+ "integrity": "sha512-SaLBuwGm3MOViRq2ABk3eLoxwZELpH6zhl3FbAoJ7Vm1gofKx6El1Ib5z23NUEhF9AsGl7y+dzLe5Cw2AArGTA==",
1357
+ "license": "ISC",
1358
+ "dependencies": {
1359
+ "d3-path": "^3.1.0"
1360
+ },
1361
+ "engines": {
1362
+ "node": ">=12"
1363
+ }
1364
+ },
1365
+ "node_modules/d3-time": {
1366
+ "version": "3.1.0",
1367
+ "resolved": "https://registry.npmjs.org/d3-time/-/d3-time-3.1.0.tgz",
1368
+ "integrity": "sha512-VqKjzBLejbSMT4IgbmVgDjpkYrNWUYJnbCGo874u7MMKIWsILRX+OpX/gTk8MqjpT1A/c6HY2dCA77ZN0lkQ2Q==",
1369
+ "license": "ISC",
1370
+ "dependencies": {
1371
+ "d3-array": "2 - 3"
1372
+ },
1373
+ "engines": {
1374
+ "node": ">=12"
1375
+ }
1376
+ },
1377
+ "node_modules/d3-time-format": {
1378
+ "version": "4.1.0",
1379
+ "resolved": "https://registry.npmjs.org/d3-time-format/-/d3-time-format-4.1.0.tgz",
1380
+ "integrity": "sha512-dJxPBlzC7NugB2PDLwo9Q8JiTR3M3e4/XANkreKSUxF8vvXKqm1Yfq4Q5dl8budlunRVlUUaDUgFt7eA8D6NLg==",
1381
+ "license": "ISC",
1382
+ "dependencies": {
1383
+ "d3-time": "1 - 3"
1384
+ },
1385
+ "engines": {
1386
+ "node": ">=12"
1387
+ }
1388
+ },
1389
+ "node_modules/d3-timer": {
1390
+ "version": "3.0.1",
1391
+ "resolved": "https://registry.npmjs.org/d3-timer/-/d3-timer-3.0.1.tgz",
1392
+ "integrity": "sha512-ndfJ/JxxMd3nw31uyKoY2naivF+r29V+Lc0svZxe1JvvIRmi8hUsrMvdOwgS1o6uBHmiz91geQ0ylPP0aj1VUA==",
1393
+ "license": "ISC",
1394
+ "engines": {
1395
+ "node": ">=12"
1396
+ }
1397
+ },
1398
  "node_modules/debug": {
1399
  "version": "4.4.3",
1400
  "resolved": "https://registry.npmjs.org/debug/-/debug-4.4.3.tgz",
 
1413
  }
1414
  }
1415
  },
1416
+ "node_modules/decimal.js-light": {
1417
+ "version": "2.5.1",
1418
+ "resolved": "https://registry.npmjs.org/decimal.js-light/-/decimal.js-light-2.5.1.tgz",
1419
+ "integrity": "sha512-qIMFpTMZmny+MMIitAB6D7iVPEorVw6YQRWkvarTkT4tBeSLLiHzcwj6q0MmYSFCiVpiqPJTJEYIrpcPzVEIvg==",
1420
+ "license": "MIT"
1421
+ },
1422
  "node_modules/deep-is": {
1423
  "version": "0.1.4",
1424
  "resolved": "https://registry.npmjs.org/deep-is/-/deep-is-0.1.4.tgz",
 
1443
  "dev": true,
1444
  "license": "ISC"
1445
  },
1446
+ "node_modules/es-toolkit": {
1447
+ "version": "1.46.0",
1448
+ "resolved": "https://registry.npmjs.org/es-toolkit/-/es-toolkit-1.46.0.tgz",
1449
+ "integrity": "sha512-IToJ6ct9OLl5zz6WsC/1vZEwfSZ7Myil+ygl5Tf30Xjn9AEkzNB4kqp2G7VUJKF1DtTx/ra5M5KLlXvzOg51BA==",
1450
+ "license": "MIT",
1451
+ "workspaces": [
1452
+ "docs",
1453
+ "benchmarks"
1454
+ ]
1455
+ },
1456
  "node_modules/escalade": {
1457
  "version": "3.2.0",
1458
  "resolved": "https://registry.npmjs.org/escalade/-/escalade-3.2.0.tgz",
 
1661
  "node": ">=0.10.0"
1662
  }
1663
  },
1664
+ "node_modules/eventemitter3": {
1665
+ "version": "5.0.4",
1666
+ "resolved": "https://registry.npmjs.org/eventemitter3/-/eventemitter3-5.0.4.tgz",
1667
+ "integrity": "sha512-mlsTRyGaPBjPedk6Bvw+aqbsXDtoAyAzm5MO7JgU+yVRyMQ5O8bD4Kcci7BS85f93veegeCPkL8R4GLClnjLFw==",
1668
+ "license": "MIT"
1669
+ },
1670
  "node_modules/fast-deep-equal": {
1671
  "version": "3.1.3",
1672
  "resolved": "https://registry.npmjs.org/fast-deep-equal/-/fast-deep-equal-3.1.3.tgz",
 
1845
  "node": ">= 4"
1846
  }
1847
  },
1848
+ "node_modules/immer": {
1849
+ "version": "10.2.0",
1850
+ "resolved": "https://registry.npmjs.org/immer/-/immer-10.2.0.tgz",
1851
+ "integrity": "sha512-d/+XTN3zfODyjr89gM3mPq1WNX2B8pYsu7eORitdwyA2sBubnTl3laYlBk4sXY5FUa5qTZGBDPJICVbvqzjlbw==",
1852
+ "license": "MIT",
1853
+ "funding": {
1854
+ "type": "opencollective",
1855
+ "url": "https://opencollective.com/immer"
1856
+ }
1857
+ },
1858
  "node_modules/import-fresh": {
1859
  "version": "3.3.1",
1860
  "resolved": "https://registry.npmjs.org/import-fresh/-/import-fresh-3.3.1.tgz",
 
1882
  "node": ">=0.8.19"
1883
  }
1884
  },
1885
+ "node_modules/internmap": {
1886
+ "version": "2.0.3",
1887
+ "resolved": "https://registry.npmjs.org/internmap/-/internmap-2.0.3.tgz",
1888
+ "integrity": "sha512-5Hh7Y1wQbvY5ooGgPbDaL5iYLAPzMTUrjMulskHLH6wnv/A+1q5rgEaiuqEjB+oxGXIVZs1FF+R/KPN3ZSQYYg==",
1889
+ "license": "ISC",
1890
+ "engines": {
1891
+ "node": ">=12"
1892
+ }
1893
+ },
1894
  "node_modules/is-extglob": {
1895
  "version": "2.1.1",
1896
  "resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-2.1.1.tgz",
 
2527
  "resolved": "https://registry.npmjs.org/react-dom/-/react-dom-19.2.5.tgz",
2528
  "integrity": "sha512-J5bAZz+DXMMwW/wV3xzKke59Af6CHY7G4uYLN1OvBcKEsWOs4pQExj86BBKamxl/Ik5bx9whOrvBlSDfWzgSag==",
2529
  "license": "MIT",
2530
+ "peer": true,
2531
  "dependencies": {
2532
  "scheduler": "^0.27.0"
2533
  },
 
2535
  "react": "^19.2.5"
2536
  }
2537
  },
2538
+ "node_modules/react-is": {
2539
+ "version": "19.2.5",
2540
+ "resolved": "https://registry.npmjs.org/react-is/-/react-is-19.2.5.tgz",
2541
+ "integrity": "sha512-Dn0t8IQhCmeIT3wu+Apm1/YVsJXsGWi6k4sPdnBIdqMVtHtv0IGi6dcpNpNkNac0zB2uUAqNX3MHzN8c+z2rwQ==",
2542
+ "license": "MIT",
2543
+ "peer": true
2544
+ },
2545
+ "node_modules/react-redux": {
2546
+ "version": "9.2.0",
2547
+ "resolved": "https://registry.npmjs.org/react-redux/-/react-redux-9.2.0.tgz",
2548
+ "integrity": "sha512-ROY9fvHhwOD9ySfrF0wmvu//bKCQ6AeZZq1nJNtbDC+kk5DuSuNX/n6YWYF/SYy7bSba4D4FSz8DJeKY/S/r+g==",
2549
+ "license": "MIT",
2550
+ "peer": true,
2551
+ "dependencies": {
2552
+ "@types/use-sync-external-store": "^0.0.6",
2553
+ "use-sync-external-store": "^1.4.0"
2554
+ },
2555
+ "peerDependencies": {
2556
+ "@types/react": "^18.2.25 || ^19",
2557
+ "react": "^18.0 || ^19",
2558
+ "redux": "^5.0.0"
2559
+ },
2560
+ "peerDependenciesMeta": {
2561
+ "@types/react": {
2562
+ "optional": true
2563
+ },
2564
+ "redux": {
2565
+ "optional": true
2566
+ }
2567
+ }
2568
+ },
2569
+ "node_modules/recharts": {
2570
+ "version": "3.8.1",
2571
+ "resolved": "https://registry.npmjs.org/recharts/-/recharts-3.8.1.tgz",
2572
+ "integrity": "sha512-mwzmO1s9sFL0TduUpwndxCUNoXsBw3u3E/0+A+cLcrSfQitSG62L32N69GhqUrrT5qKcAE3pCGVINC6pqkBBQg==",
2573
+ "license": "MIT",
2574
+ "workspaces": [
2575
+ "www"
2576
+ ],
2577
+ "dependencies": {
2578
+ "@reduxjs/toolkit": "^1.9.0 || 2.x.x",
2579
+ "clsx": "^2.1.1",
2580
+ "decimal.js-light": "^2.5.1",
2581
+ "es-toolkit": "^1.39.3",
2582
+ "eventemitter3": "^5.0.1",
2583
+ "immer": "^10.1.1",
2584
+ "react-redux": "8.x.x || 9.x.x",
2585
+ "reselect": "5.1.1",
2586
+ "tiny-invariant": "^1.3.3",
2587
+ "use-sync-external-store": "^1.2.2",
2588
+ "victory-vendor": "^37.0.2"
2589
+ },
2590
+ "engines": {
2591
+ "node": ">=18"
2592
+ },
2593
+ "peerDependencies": {
2594
+ "react": "^16.8.0 || ^17.0.0 || ^18.0.0 || ^19.0.0",
2595
+ "react-dom": "^16.0.0 || ^17.0.0 || ^18.0.0 || ^19.0.0",
2596
+ "react-is": "^16.8.0 || ^17.0.0 || ^18.0.0 || ^19.0.0"
2597
+ }
2598
+ },
2599
+ "node_modules/redux": {
2600
+ "version": "5.0.1",
2601
+ "resolved": "https://registry.npmjs.org/redux/-/redux-5.0.1.tgz",
2602
+ "integrity": "sha512-M9/ELqF6fy8FwmkpnF0S3YKOqMyoWJ4+CS5Efg2ct3oY9daQvd/Pc71FpGZsVsbl3Cpb+IIcjBDUnnyBdQbq4w==",
2603
+ "license": "MIT",
2604
+ "peer": true
2605
+ },
2606
+ "node_modules/redux-thunk": {
2607
+ "version": "3.1.0",
2608
+ "resolved": "https://registry.npmjs.org/redux-thunk/-/redux-thunk-3.1.0.tgz",
2609
+ "integrity": "sha512-NW2r5T6ksUKXCabzhL9z+h206HQw/NJkcLm1GPImRQ8IzfXwRGqjVhKJGauHirT0DAuyy6hjdnMZaRoAcy0Klw==",
2610
+ "license": "MIT",
2611
+ "peerDependencies": {
2612
+ "redux": "^5.0.0"
2613
+ }
2614
+ },
2615
+ "node_modules/reselect": {
2616
+ "version": "5.1.1",
2617
+ "resolved": "https://registry.npmjs.org/reselect/-/reselect-5.1.1.tgz",
2618
+ "integrity": "sha512-K/BG6eIky/SBpzfHZv/dd+9JBFiS4SWV7FIujVyJRux6e45+73RaUHXLmIR1f7WOMaQ0U1km6qwklRQxpJJY0w==",
2619
+ "license": "MIT"
2620
+ },
2621
  "node_modules/resolve-from": {
2622
  "version": "4.0.0",
2623
  "resolved": "https://registry.npmjs.org/resolve-from/-/resolve-from-4.0.0.tgz",
 
2744
  "node": ">=8"
2745
  }
2746
  },
2747
+ "node_modules/tiny-invariant": {
2748
+ "version": "1.3.3",
2749
+ "resolved": "https://registry.npmjs.org/tiny-invariant/-/tiny-invariant-1.3.3.tgz",
2750
+ "integrity": "sha512-+FbBPE1o9QAYvviau/qC5SE3caw21q3xkvWKBtja5vgqOWIHHJ3ioaq1VPfn/Szqctz2bU/oYeKd9/z5BL+PVg==",
2751
+ "license": "MIT"
2752
+ },
2753
  "node_modules/tinyglobby": {
2754
  "version": "0.2.16",
2755
  "resolved": "https://registry.npmjs.org/tinyglobby/-/tinyglobby-0.2.16.tgz",
 
2829
  "punycode": "^2.1.0"
2830
  }
2831
  },
2832
+ "node_modules/use-sync-external-store": {
2833
+ "version": "1.6.0",
2834
+ "resolved": "https://registry.npmjs.org/use-sync-external-store/-/use-sync-external-store-1.6.0.tgz",
2835
+ "integrity": "sha512-Pp6GSwGP/NrPIrxVFAIkOQeyw8lFenOHijQWkUTrDvrF4ALqylP2C/KCkeS9dpUM3KvYRQhna5vt7IL95+ZQ9w==",
2836
+ "license": "MIT",
2837
+ "peerDependencies": {
2838
+ "react": "^16.8.0 || ^17.0.0 || ^18.0.0 || ^19.0.0"
2839
+ }
2840
+ },
2841
+ "node_modules/victory-vendor": {
2842
+ "version": "37.3.6",
2843
+ "resolved": "https://registry.npmjs.org/victory-vendor/-/victory-vendor-37.3.6.tgz",
2844
+ "integrity": "sha512-SbPDPdDBYp+5MJHhBCAyI7wKM3d5ivekigc2Dk2s7pgbZ9wIgIBYGVw4zGHBml/qTFbexrofXW6Gu4noGxrOwQ==",
2845
+ "license": "MIT AND ISC",
2846
+ "dependencies": {
2847
+ "@types/d3-array": "^3.0.3",
2848
+ "@types/d3-ease": "^3.0.0",
2849
+ "@types/d3-interpolate": "^3.0.1",
2850
+ "@types/d3-scale": "^4.0.2",
2851
+ "@types/d3-shape": "^3.1.0",
2852
+ "@types/d3-time": "^3.0.0",
2853
+ "@types/d3-timer": "^3.0.0",
2854
+ "d3-array": "^3.1.6",
2855
+ "d3-ease": "^3.0.1",
2856
+ "d3-interpolate": "^3.0.1",
2857
+ "d3-scale": "^4.0.2",
2858
+ "d3-shape": "^3.1.0",
2859
+ "d3-time": "^3.0.0",
2860
+ "d3-timer": "^3.0.1"
2861
+ }
2862
+ },
2863
  "node_modules/vite": {
2864
  "version": "8.0.9",
2865
  "resolved": "https://registry.npmjs.org/vite/-/vite-8.0.9.tgz",
frontend/package.json CHANGED
@@ -11,7 +11,8 @@
11
  },
12
  "dependencies": {
13
  "react": "^19.2.5",
14
- "react-dom": "^19.2.5"
 
15
  },
16
  "devDependencies": {
17
  "@eslint/js": "^9.39.4",
 
11
  },
12
  "dependencies": {
13
  "react": "^19.2.5",
14
+ "react-dom": "^19.2.5",
15
+ "recharts": "^3.8.1"
16
  },
17
  "devDependencies": {
18
  "@eslint/js": "^9.39.4",
frontend/src/CodeArenaRL.jsx CHANGED
@@ -1,5 +1,5 @@
1
-
2
  import React, { useState, useEffect, useRef, useCallback } from "react";
 
3
 
4
  /* ─────────────────────────────────────────────
5
  GOOGLE FONTS
@@ -129,6 +129,12 @@ const GlobalStyles = () => (
129
  TASKS (mirrors server tasks — display only)
130
  ───────────────────────────────────────────── */
131
  const TASKS = {
 
 
 
 
 
 
132
  "easy-1": {
133
  id: "easy-1", label: "Easy", name: "Fix average_list()", difficulty: "easy",
134
  description: "Fix syntax errors: missing colon after def and uses length() instead of len().",
@@ -176,37 +182,26 @@ function AnsiLine({ text }) {
176
  }
177
 
178
  /* ─────────────────────────────────────────────
179
- REWARD CHART
180
  ───────────────────────────────────────────── */
181
  function RewardChart({ rewards }) {
182
- const W = 260, H = 100, PAD = 20;
183
- const pts = rewards.map((r, i) => ({
184
- x: PAD + (i / Math.max(4, 1)) * (W - PAD * 2),
185
- y: PAD + (1 - r) * (H - PAD * 2),
186
- r,
187
- }));
188
- const pathD = pts.length > 1 ? pts.reduce((a, p, i) => i === 0 ? `M${p.x},${p.y}` : a + ` L${p.x},${p.y}`, "") : "";
189
- const areaD = pts.length > 1 ? `${pathD} L${pts[pts.length - 1].x},${H - PAD} L${pts[0].x},${H - PAD} Z` : "";
190
  return (
191
- <svg width="100%" viewBox={`0 0 ${W} ${H}`}>
192
- <defs>
193
- <linearGradient id="rg" x1="0" y1="0" x2="0" y2="1">
194
- <stop offset="0%" stopColor="#00ff88" stopOpacity="0.3" />
195
- <stop offset="100%" stopColor="#00ff88" stopOpacity="0" />
196
- </linearGradient>
197
- </defs>
198
- {[0, 0.5, 1].map(v => {
199
- const y = PAD + (1 - v) * (H - PAD * 2);
200
- return <line key={v} x1={PAD} y1={y} x2={W - PAD} y2={y} stroke="#1e293b" strokeWidth="1" strokeDasharray="3,3" />;
201
- })}
202
- {[1, 2, 3, 4, 5].map(s => (
203
- <text key={s} x={PAD + ((s - 1) / 4) * (W - PAD * 2)} y={H - 4}
204
- fill="#334155" fontSize="8" textAnchor="middle" fontFamily="JetBrains Mono">{s}</text>
205
- ))}
206
- {areaD && <path d={areaD} fill="url(#rg)" />}
207
- {pathD && <path d={pathD} fill="none" stroke="#00ff88" strokeWidth="2" strokeLinecap="round" strokeLinejoin="round" />}
208
- {pts.map((p, i) => <circle key={i} cx={p.x} cy={p.y} r="4" fill="#0a0e1a" stroke={rewardColor(p.r)} strokeWidth="2" />)}
209
- </svg>
210
  );
211
  }
212
 
@@ -226,6 +221,9 @@ export default function CodeArenaRL() {
226
 
227
  /* ── Task & episode state ── */
228
  const [selectedTask, setSelectedTask] = useState("easy-1");
 
 
 
229
  const [envState, setEnvState] = useState(null); // observation from server
230
  const [uiMode, setUiMode] = useState("idle"); // idle|resetting|agent_thinking|executing|done
231
  const [episodeLog, setEpisodeLog] = useState([]);
@@ -305,7 +303,7 @@ export default function CodeArenaRL() {
305
  });
306
  if (!res.ok) throw new Error(`/reset failed: ${res.status}`);
307
  const data = await res.json();
308
- return data.observation; // { buggy_code, error_log, test_results, previous_attempts }
309
  }, [envUrl]);
310
 
311
  const envStep = useCallback(async (proposedFix) => {
@@ -463,6 +461,9 @@ export default function CodeArenaRL() {
463
  setManualCode(""); setTokenEst(0);
464
  setCollapsedEntries(new Set());
465
  setErrorBanner("");
 
 
 
466
  }, []);
467
 
468
  /* ──────────────────────────────────────────
@@ -512,6 +513,7 @@ export default function CodeArenaRL() {
512
 
513
  const { observation: newObs, reward, done } = stepResult;
514
  const meta = stepResult.info?.execution_metadata || {};
 
515
  const passed = meta.test_passed ?? 0;
516
  const total = meta.test_total ?? task.hints.length + 1;
517
  const newStep = currentStepCount + 1;
@@ -526,11 +528,19 @@ export default function CodeArenaRL() {
526
  setStepCount(newStep);
527
  setRewards(prev => [...prev, reward]);
528
  setIsDone(done);
 
 
 
 
 
529
 
530
  const logEntry = {
531
  step: newStep,
532
  code_submitted: fixedCode,
533
  reward, done, passed, total,
 
 
 
534
  error_log: newObs?.error_log || "",
535
  test_results: newObs?.test_results || "",
536
  timestamp: new Date().toISOString(),
@@ -570,9 +580,9 @@ export default function CodeArenaRL() {
570
  runningRef.current = true;
571
  setUiMode("resetting");
572
 
573
- let initialObs;
574
  try {
575
- initialObs = await envReset(selectedTask);
576
  } catch (err) {
577
  setErrorBanner(`🌐 OpenEnv /reset Error: ${err.message}`);
578
  setUiMode("idle");
@@ -580,6 +590,9 @@ export default function CodeArenaRL() {
580
  return;
581
  }
582
 
 
 
 
583
  setEnvState(initialObs);
584
  setTimeout(() => runStep(initialObs, 0), 400);
585
  }, [ollamaStatus, envStatus, manualMode, resetEpisode, envReset, selectedTask, runStep]);
@@ -865,7 +878,14 @@ export default function CodeArenaRL() {
865
  <div className="panel">
866
  <div className="panel-header">
867
  <span style={{ color: "#ff4455" }}>⚠</span>&nbsp;Buggy Code
868
- <span style={{ marginLeft: "auto" }}><span className={`badge badge-${task.difficulty}`}>{task.id}</span></span>
 
 
 
 
 
 
 
869
  </div>
870
  <div style={{ padding: 14 }}>
871
  <pre className="code-block" style={{ color: "#f8c8c8", maxHeight: 170, overflowY: "auto" }}>
@@ -1009,6 +1029,30 @@ export default function CodeArenaRL() {
1009
  </div>
1010
  )}
1011
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1012
  {/* Episode Log */}
1013
  <div className="panel" style={{ flex: 1, display: "flex", flexDirection: "column" }}>
1014
  <div className="panel-header" style={{ justifyContent: "space-between" }}>
 
 
1
  import React, { useState, useEffect, useRef, useCallback } from "react";
2
+ import { LineChart, Line, XAxis, YAxis, Tooltip, ResponsiveContainer, ReferenceLine } from "recharts";
3
 
4
  /* ─────────────────────────────────────────────
5
  GOOGLE FONTS
 
129
  TASKS (mirrors server tasks — display only)
130
  ───────────────────────────────────────────── */
131
  const TASKS = {
132
+ "auto": {
133
+ id: "auto", label: "Auto", name: "Adaptive Curriculum", difficulty: "info",
134
+ description: "Automatically selects difficulty based on recent performance history.",
135
+ hints: ["If avg < 0.4 -> Easy", "If avg < 0.75 -> Medium", "Else -> Hard"],
136
+ buggy_code: "# Click Start Episode to fetch task",
137
+ },
138
  "easy-1": {
139
  id: "easy-1", label: "Easy", name: "Fix average_list()", difficulty: "easy",
140
  description: "Fix syntax errors: missing colon after def and uses length() instead of len().",
 
182
  }
183
 
184
  /* ─────────────────────────────────────────────
185
+ REWARD CHART (Recharts)
186
  ───────────────────────────────────────────── */
187
  function RewardChart({ rewards }) {
188
+ const data = rewards.map((r, i) => ({ step: i + 1, reward: r }));
189
+ for (let i = data.length + 1; i <= 5; i++) {
190
+ data.push({ step: i, reward: null });
191
+ }
 
 
 
 
192
  return (
193
+ <div style={{ width: "100%", height: 120 }}>
194
+ <ResponsiveContainer width="100%" height="100%">
195
+ <LineChart data={data} margin={{ top: 10, right: 10, left: -20, bottom: 0 }}>
196
+ <XAxis dataKey="step" stroke="#334155" tick={{ fill: "#334155", fontSize: 10, fontFamily: "'JetBrains Mono',monospace" }} />
197
+ <YAxis domain={[0, 1]} ticks={[0, 0.5, 1]} stroke="#334155" tick={{ fill: "#334155", fontSize: 10, fontFamily: "'JetBrains Mono',monospace" }} />
198
+ <ReferenceLine y={0.5} stroke="#334155" strokeDasharray="3 3" />
199
+ <ReferenceLine y={1.0} stroke="#334155" strokeDasharray="3 3" />
200
+ <Tooltip contentStyle={{ backgroundColor: "#0f172a", border: "1px solid #1e293b", borderRadius: 4, fontFamily: "'JetBrains Mono',monospace", fontSize: 10 }} itemStyle={{ color: "#00ff88" }} />
201
+ <Line type="monotone" dataKey="reward" stroke="#00ff88" strokeWidth={2} dot={{ fill: "#0a0e1a", stroke: "#00ff88", strokeWidth: 2, r: 4 }} isAnimationActive={true} />
202
+ </LineChart>
203
+ </ResponsiveContainer>
204
+ </div>
 
 
 
 
 
 
 
205
  );
206
  }
207
 
 
221
 
222
  /* ── Task & episode state ── */
223
  const [selectedTask, setSelectedTask] = useState("easy-1");
224
+ const [currentEnvTask, setCurrentEnvTask] = useState("");
225
+ const [currentEnvDifficulty, setCurrentEnvDifficulty] = useState("");
226
+ const [currentRewardComponents, setCurrentRewardComponents] = useState({ compile_score: 0, test_ratio: 0, efficiency_score: 0 });
227
  const [envState, setEnvState] = useState(null); // observation from server
228
  const [uiMode, setUiMode] = useState("idle"); // idle|resetting|agent_thinking|executing|done
229
  const [episodeLog, setEpisodeLog] = useState([]);
 
303
  });
304
  if (!res.ok) throw new Error(`/reset failed: ${res.status}`);
305
  const data = await res.json();
306
+ return data; // { observation, info }
307
  }, [envUrl]);
308
 
309
  const envStep = useCallback(async (proposedFix) => {
 
461
  setManualCode(""); setTokenEst(0);
462
  setCollapsedEntries(new Set());
463
  setErrorBanner("");
464
+ setCurrentEnvTask("");
465
+ setCurrentEnvDifficulty("");
466
+ setCurrentRewardComponents({ compile_score: 0, test_ratio: 0, efficiency_score: 0 });
467
  }, []);
468
 
469
  /* ──────────────────────────────────────────
 
513
 
514
  const { observation: newObs, reward, done } = stepResult;
515
  const meta = stepResult.info?.execution_metadata || {};
516
+ const rc = stepResult.info?.reward_components || {};
517
  const passed = meta.test_passed ?? 0;
518
  const total = meta.test_total ?? task.hints.length + 1;
519
  const newStep = currentStepCount + 1;
 
528
  setStepCount(newStep);
529
  setRewards(prev => [...prev, reward]);
530
  setIsDone(done);
531
+ setCurrentRewardComponents({
532
+ compile_score: rc.compile_score || 0,
533
+ test_ratio: rc.test_ratio || 0,
534
+ efficiency_score: rc.efficiency || 0,
535
+ });
536
 
537
  const logEntry = {
538
  step: newStep,
539
  code_submitted: fixedCode,
540
  reward, done, passed, total,
541
+ compile_score: rc.compile_score || 0,
542
+ test_ratio: rc.test_ratio || 0,
543
+ efficiency_score: rc.efficiency || 0,
544
  error_log: newObs?.error_log || "",
545
  test_results: newObs?.test_results || "",
546
  timestamp: new Date().toISOString(),
 
580
  runningRef.current = true;
581
  setUiMode("resetting");
582
 
583
+ let initialResp;
584
  try {
585
+ initialResp = await envReset(selectedTask);
586
  } catch (err) {
587
  setErrorBanner(`🌐 OpenEnv /reset Error: ${err.message}`);
588
  setUiMode("idle");
 
590
  return;
591
  }
592
 
593
+ const initialObs = initialResp.observation;
594
+ setCurrentEnvTask(initialResp.info?.task_id || selectedTask);
595
+ setCurrentEnvDifficulty(initialResp.info?.difficulty || "");
596
  setEnvState(initialObs);
597
  setTimeout(() => runStep(initialObs, 0), 400);
598
  }, [ollamaStatus, envStatus, manualMode, resetEpisode, envReset, selectedTask, runStep]);
 
878
  <div className="panel">
879
  <div className="panel-header">
880
  <span style={{ color: "#ff4455" }}>⚠</span>&nbsp;Buggy Code
881
+ <span style={{ marginLeft: "auto", display: "flex", gap: 6 }}>
882
+ {currentEnvDifficulty && (
883
+ <span className={`badge badge-${currentEnvDifficulty.toLowerCase()}`}>
884
+ {currentEnvDifficulty}
885
+ </span>
886
+ )}
887
+ <span className={`badge badge-${task.difficulty}`}>{currentEnvTask || task.id}</span>
888
+ </span>
889
  </div>
890
  <div style={{ padding: 14 }}>
891
  <pre className="code-block" style={{ color: "#f8c8c8", maxHeight: 170, overflowY: "auto" }}>
 
1029
  </div>
1030
  )}
1031
 
1032
+ {/* Live Reward Components */}
1033
+ {stepCount > 0 && (
1034
+ <div className="panel fade-in">
1035
+ <div className="panel-header">🏅 &nbsp;Reward Components</div>
1036
+ <div style={{ padding: "12px 14px", display: "flex", flexDirection: "column", gap: 12 }}>
1037
+ {[
1038
+ { label: "Compile Score", val: currentRewardComponents.compile_score },
1039
+ { label: "Test Pass Ratio", val: currentRewardComponents.test_ratio },
1040
+ { label: "Efficiency", val: currentRewardComponents.efficiency_score },
1041
+ ].map(c => (
1042
+ <div key={c.label}>
1043
+ <div style={{ display: "flex", justifyContent: "space-between", fontSize: 10, fontFamily: "'JetBrains Mono',monospace", color: "#64748b", marginBottom: 4 }}>
1044
+ <span>{c.label}</span>
1045
+ <span style={{ color: rewardColor(c.val) }}>{c.val.toFixed(2)}</span>
1046
+ </div>
1047
+ <div className="reward-bar-outer" style={{ marginTop: 0, height: 4 }}>
1048
+ <div className="reward-bar-inner" style={{ width: `${c.val * 100}%`, background: `linear-gradient(90deg, ${rewardColor(0)}, ${rewardColor(c.val)})` }} />
1049
+ </div>
1050
+ </div>
1051
+ ))}
1052
+ </div>
1053
+ </div>
1054
+ )}
1055
+
1056
  {/* Episode Log */}
1057
  <div className="panel" style={{ flex: 1, display: "flex", flexDirection: "column" }}>
1058
  <div className="panel-header" style={{ justifyContent: "space-between" }}>
inference.py CHANGED
@@ -4,21 +4,27 @@ Rewritten for strict OpenEnv parsing.
4
  """
5
 
6
  import os
 
7
  import httpx
 
8
  from openai import OpenAI
9
 
10
- def run_task(task_id: str):
11
  # Retrieve environment variables as instructed
12
  base_url = os.environ.get("API_BASE_URL")
13
  api_key = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
14
  model_name = os.environ.get("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
15
 
16
- # We pass base_url explicitly. If os.environ["API_BASE_URL"] was strictly intended,
17
- # it is fine since OpenAI client accepts None for default.
18
- client = OpenAI(
19
- base_url=base_url,
20
- api_key=api_key or "NO_KEY_PROVIDED"
21
- )
 
 
 
 
22
 
23
  # 1. Print the [START] line
24
  print(f"[START] task={task_id} env=codearena-rl-benchmark model={model_name}")
@@ -58,14 +64,19 @@ def run_task(task_id: str):
58
 
59
  # 3b/c. Call the LLM
60
  try:
61
- completion = client.chat.completions.create(
62
- model=model_name,
63
- messages=[
64
- {"role": "system", "content": system_prompt},
65
- {"role": "user", "content": user_prompt}
66
- ]
67
- )
68
- proposed_fix = completion.choices[0].message.content
 
 
 
 
 
69
  except Exception as e:
70
  error_msg = str(e).replace("\n", " ").replace("\r", "")
71
  # If the LLM call fails, use this fallback fix
@@ -106,6 +117,22 @@ def run_task(task_id: str):
106
  print(f"[STEP] step={step} action={action_summary} reward={reward:.2f} done={done_str} error={error_msg}")
107
 
108
  # 4. Print [END]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  success = any(r > 0.5 for r in rewards)
110
  success_str = "true" if success else "false"
111
  rewards_str = ",".join([f"{r:.2f}" for r in rewards])
@@ -113,12 +140,16 @@ def run_task(task_id: str):
113
  print(f"[END] success={success_str} steps={step} score={score:.2f} rewards={rewards_str}")
114
 
115
  def main():
 
 
 
 
116
  target_task = os.environ.get("CODEARENA_TASK")
117
  if target_task:
118
- run_task(target_task)
119
  else:
120
  for t in ["easy", "medium", "hard"]:
121
- run_task(t)
122
 
123
  if __name__ == "__main__":
124
  main()
 
4
  """
5
 
6
  import os
7
+ import argparse
8
  import httpx
9
+ from datetime import datetime
10
  from openai import OpenAI
11
 
12
+ def run_task(task_id: str, backend: str):
13
  # Retrieve environment variables as instructed
14
  base_url = os.environ.get("API_BASE_URL")
15
  api_key = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
16
  model_name = os.environ.get("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
17
 
18
+ hf_pipeline = None
19
+ client = None
20
+ if backend == "hf":
21
+ from transformers import pipeline
22
+ hf_pipeline = pipeline("text-generation", model=model_name)
23
+ else:
24
+ client = OpenAI(
25
+ base_url=base_url,
26
+ api_key=api_key or "NO_KEY_PROVIDED"
27
+ )
28
 
29
  # 1. Print the [START] line
30
  print(f"[START] task={task_id} env=codearena-rl-benchmark model={model_name}")
 
64
 
65
  # 3b/c. Call the LLM
66
  try:
67
+ if backend == "hf":
68
+ prompt = f"{system_prompt}\n\n{user_prompt}"
69
+ output = hf_pipeline(prompt, max_new_tokens=512, return_full_text=False)
70
+ proposed_fix = output[0]["generated_text"]
71
+ else:
72
+ completion = client.chat.completions.create(
73
+ model=model_name,
74
+ messages=[
75
+ {"role": "system", "content": system_prompt},
76
+ {"role": "user", "content": user_prompt}
77
+ ]
78
+ )
79
+ proposed_fix = completion.choices[0].message.content
80
  except Exception as e:
81
  error_msg = str(e).replace("\n", " ").replace("\r", "")
82
  # If the LLM call fails, use this fallback fix
 
117
  print(f"[STEP] step={step} action={action_summary} reward={reward:.2f} done={done_str} error={error_msg}")
118
 
119
  # 4. Print [END]
120
+ timestamp = datetime.now().isoformat()
121
+ compile_score, test_ratio, efficiency_score = 0.0, 0.0, 0.0
122
+ if "info" in obs_json and "reward_components" in obs_json["info"]:
123
+ rc = obs_json["info"]["reward_components"]
124
+ compile_score = rc.get("compile_score", 0.0)
125
+ test_ratio = rc.get("test_ratio", 0.0)
126
+ efficiency_score = rc.get("efficiency", 0.0)
127
+
128
+ final_reward = rewards[-1] if rewards else 0.0
129
+ csv_path = "rewards_log.csv"
130
+ write_headers = not os.path.exists(csv_path)
131
+ with open(csv_path, "a", encoding="utf-8") as f:
132
+ if write_headers:
133
+ f.write("timestamp,task_id,step,reward,compile_score,test_ratio,efficiency_score\n")
134
+ f.write(f"{timestamp},{task_id},{step},{final_reward},{compile_score},{test_ratio},{efficiency_score}\n")
135
+
136
  success = any(r > 0.5 for r in rewards)
137
  success_str = "true" if success else "false"
138
  rewards_str = ",".join([f"{r:.2f}" for r in rewards])
 
140
  print(f"[END] success={success_str} steps={step} score={score:.2f} rewards={rewards_str}")
141
 
142
  def main():
143
+ parser = argparse.ArgumentParser(description="CodeArena RL Inference")
144
+ parser.add_argument("--backend", type=str, choices=["openai", "hf"], default="openai", help="Backend to use for LLM generation.")
145
+ args = parser.parse_args()
146
+
147
  target_task = os.environ.get("CODEARENA_TASK")
148
  if target_task:
149
+ run_task(target_task, args.backend)
150
  else:
151
  for t in ["easy", "medium", "hard"]:
152
+ run_task(t, args.backend)
153
 
154
  if __name__ == "__main__":
155
  main()
openenv.yaml CHANGED
@@ -1,7 +1,7 @@
1
  name: codearena-rl-benchmark
2
  description: "RL Benchmark for Autonomous Code Repair — iterative debugging with execution feedback"
3
  version: "1.0.0"
4
- entrypoint: server.env:CodeArenaEnv
5
 
6
  runtime:
7
  language: python
@@ -12,6 +12,19 @@ api:
12
  step: /step
13
  state: /state
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  tasks:
16
  - id: easy
17
  path: tasks/easy.json
@@ -22,6 +35,12 @@ tasks:
22
  - id: hard
23
  path: tasks/hard.json
24
  grader: server.grader:grade
 
 
 
 
 
 
25
 
26
  limits:
27
  step_timeout_seconds: 2
 
1
  name: codearena-rl-benchmark
2
  description: "RL Benchmark for Autonomous Code Repair — iterative debugging with execution feedback"
3
  version: "1.0.0"
4
+ entrypoint: server.app:CodeArenaEnv
5
 
6
  runtime:
7
  language: python
 
12
  step: /step
13
  state: /state
14
 
15
+ observation_space:
16
+ type: json
17
+ schema:
18
+ buggy_code: string
19
+ error_log: string
20
+ test_results: string
21
+ previous_attempts: list[string]
22
+
23
+ action_space:
24
+ type: json
25
+ schema:
26
+ proposed_fix: string
27
+
28
  tasks:
29
  - id: easy
30
  path: tasks/easy.json
 
35
  - id: hard
36
  path: tasks/hard.json
37
  grader: server.grader:grade
38
+ - id: type_errors
39
+ path: tasks/type_errors/type_error_1.json
40
+ grader: server.grader:grade
41
+ - id: security_bugs
42
+ path: tasks/security_bugs/security_bug_1.json
43
+ grader: server.grader:grade
44
 
45
  limits:
46
  step_timeout_seconds: 2
plot_rewards.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import matplotlib.pyplot as plt
3
+ import os
4
+
5
+ def main():
6
+ os.makedirs('results', exist_ok=True)
7
+
8
+ if not os.path.exists('rewards_log.csv'):
9
+ print("No rewards_log.csv found. Run inference first.")
10
+ return
11
+
12
+ try:
13
+ df = pd.read_csv('rewards_log.csv')
14
+ except Exception as e:
15
+ print(f"Error reading CSV: {e}")
16
+ return
17
+
18
+ if df.empty:
19
+ print("rewards_log.csv is empty.")
20
+ return
21
+
22
+ # Plot 1: Reward Curve over Training Steps (using index as training step)
23
+ plt.figure(figsize=(10, 6))
24
+ plt.plot(df.index, df['reward'], alpha=0.3, label='Episode Reward')
25
+
26
+ # 10-step rolling average
27
+ rolling_avg = df['reward'].rolling(window=10, min_periods=1).mean()
28
+ plt.plot(df.index, rolling_avg, color='red', linewidth=2, label='10-step Rolling Average')
29
+
30
+ plt.xlabel('Training Step')
31
+ plt.ylabel('Episode Reward (0-1)')
32
+ plt.title('Reward Curve')
33
+ plt.legend()
34
+ plt.grid(True, alpha=0.3)
35
+ plt.savefig('results/reward_curve.png')
36
+ plt.close()
37
+
38
+ # Plot 2: Average Reward per Task ID
39
+ plt.figure(figsize=(10, 6))
40
+ avg_per_task = df.groupby('task_id')['reward'].mean().sort_values()
41
+ avg_per_task.plot(kind='barh', color='skyblue')
42
+ plt.xlabel('Average Episode Reward (0-1)')
43
+ plt.ylabel('Task ID')
44
+ plt.title('Average Reward by Task ID')
45
+ plt.grid(axis='x', alpha=0.3)
46
+ plt.tight_layout()
47
+ plt.savefig('results/reward_by_task.png')
48
+ plt.close()
49
+
50
+ print("Plots saved to results/ directory.")
51
+
52
+ if __name__ == "__main__":
53
+ main()
server/app.py CHANGED
@@ -42,8 +42,21 @@ class CodeArenaEnv:
42
  self.is_done = False
43
  self.step_count = 0
44
  self.max_steps = 5
 
45
 
46
  def reset(self, task_id: str = "easy") -> CodeArenaObservation:
 
 
 
 
 
 
 
 
 
 
 
 
47
  # Priority: exact task_id match → difficulty match → random
48
  if task_id in TASK_ID_MAP:
49
  self.current_task = TASK_ID_MAP[task_id]
@@ -71,7 +84,13 @@ class CodeArenaEnv:
71
  timeout=max(self.current_task.optimal_time_seconds * 10, 2.0),
72
  )
73
 
74
- reward = calculate_reward(exec_result, self.current_task)
 
 
 
 
 
 
75
 
76
  self.previous_attempts.append(action.proposed_fix)
77
  self.last_error_log = exec_result.runtime_errors
@@ -79,14 +98,18 @@ class CodeArenaEnv:
79
  f"{exec_result.test_passed}/{exec_result.test_total} tests passed."
80
  )
81
 
82
- if reward > 0.99 or self.step_count >= self.max_steps:
83
  self.is_done = True
 
 
 
84
 
85
  info = {
86
  "execution_metadata": exec_result.model_dump(),
87
  "task_id": self.current_task.task_id,
 
88
  }
89
- return self._state(), reward, self.is_done, info
90
 
91
  def _state(self) -> CodeArenaObservation:
92
  if not self.current_task:
@@ -129,6 +152,10 @@ def api_reset(body: ResetRequest = ResetRequest()):
129
  "status": "success",
130
  "message": "Environment reset successfully",
131
  "observation": obs.model_dump(),
 
 
 
 
132
  }
133
  except Exception:
134
  traceback.print_exc()
 
42
  self.is_done = False
43
  self.step_count = 0
44
  self.max_steps = 5
45
+ self.episode_rewards_history: list[float] = []
46
 
47
  def reset(self, task_id: str = "easy") -> CodeArenaObservation:
48
+ if task_id == "auto":
49
+ if not self.episode_rewards_history:
50
+ task_id = "easy"
51
+ else:
52
+ avg_reward = sum(self.episode_rewards_history) / len(self.episode_rewards_history)
53
+ if avg_reward < 0.4:
54
+ task_id = "easy"
55
+ elif avg_reward <= 0.75:
56
+ task_id = "medium"
57
+ else:
58
+ task_id = "hard"
59
+
60
  # Priority: exact task_id match → difficulty match → random
61
  if task_id in TASK_ID_MAP:
62
  self.current_task = TASK_ID_MAP[task_id]
 
84
  timeout=max(self.current_task.optimal_time_seconds * 10, 2.0),
85
  )
86
 
87
+ base_reward, reward_components = calculate_reward(exec_result, self.current_task, action.proposed_fix)
88
+
89
+ step_penalty = 0.02 * self.step_count
90
+ novelty_penalty = 0.1 if action.proposed_fix in self.previous_attempts else 0.0
91
+
92
+ final_reward = base_reward - step_penalty - novelty_penalty
93
+ final_reward = max(0.001, min(0.999, float(final_reward)))
94
 
95
  self.previous_attempts.append(action.proposed_fix)
96
  self.last_error_log = exec_result.runtime_errors
 
98
  f"{exec_result.test_passed}/{exec_result.test_total} tests passed."
99
  )
100
 
101
+ if final_reward > 0.99 or self.step_count >= self.max_steps:
102
  self.is_done = True
103
+ self.episode_rewards_history.append(final_reward)
104
+ if len(self.episode_rewards_history) > 5:
105
+ self.episode_rewards_history.pop(0)
106
 
107
  info = {
108
  "execution_metadata": exec_result.model_dump(),
109
  "task_id": self.current_task.task_id,
110
+ "reward_components": reward_components
111
  }
112
+ return self._state(), final_reward, self.is_done, info
113
 
114
  def _state(self) -> CodeArenaObservation:
115
  if not self.current_task:
 
152
  "status": "success",
153
  "message": "Environment reset successfully",
154
  "observation": obs.model_dump(),
155
+ "info": {
156
+ "task_id": _env.current_task.task_id if _env.current_task else "",
157
+ "difficulty": _env.current_task.difficulty if _env.current_task else ""
158
+ }
159
  }
160
  except Exception:
161
  traceback.print_exc()
server/env.py DELETED
@@ -1,116 +0,0 @@
1
- import random
2
- from fastapi import FastAPI, HTTPException
3
- from contextlib import asynccontextmanager
4
-
5
- from .models import CodeArenaObservation, CodeArenaAction, TaskInfo
6
- from .executor import run_code_with_tests
7
- from .grader import calculate_reward, safe_reward, force_valid_reward
8
- from tasks import ALL_TASKS
9
-
10
- class CodeArenaEnv:
11
- def __init__(self):
12
- self.tasks = ALL_TASKS
13
- self.current_task: TaskInfo = None
14
- self.previous_attempts = []
15
- self.last_error_log = ""
16
- self.last_test_results = ""
17
- self.is_done = False
18
- self.step_count = 0
19
- self.max_steps = 5
20
-
21
- def reset(self, task_id: str = None) -> CodeArenaObservation:
22
- if task_id:
23
- matched = [t for t in self.tasks if t.task_id == task_id]
24
- self.current_task = matched[0] if matched else random.choice(self.tasks)
25
- else:
26
- self.current_task = random.choice(self.tasks)
27
- self.previous_attempts = []
28
- self.last_error_log = ""
29
- self.last_test_results = ""
30
- self.is_done = False
31
- self.step_count = 0
32
- return self.state()
33
-
34
- def step(self, action: CodeArenaAction) -> tuple[CodeArenaObservation, float, bool, dict]:
35
- if self.is_done:
36
- raise ValueError("Environment is already done. Call reset().")
37
-
38
- self.step_count += 1
39
-
40
- # Execute the proposed fix with 10x optimal time as a hard timeout limit
41
- exec_result = run_code_with_tests(
42
- code=action.proposed_fix,
43
- test_code=self.current_task.test_code,
44
- timeout=max(self.current_task.optimal_time_seconds * 10, 2.0)
45
- )
46
-
47
- # Calculate Reward
48
- reward = safe_reward(calculate_reward(exec_result, self.current_task))
49
- reward = max(0.001, min(0.999, float(reward)))
50
-
51
- # Update State
52
- self.previous_attempts.append(action.proposed_fix)
53
- self.last_error_log = exec_result.runtime_errors
54
- self.last_test_results = f"{exec_result.test_passed}/{exec_result.test_total} tests passed."
55
-
56
- # Check termination condition
57
- if reward > 0.99 or self.step_count >= self.max_steps:
58
- self.is_done = True
59
-
60
- info = {
61
- "execution_metadata": exec_result.model_dump(),
62
- "task_id": self.current_task.task_id
63
- }
64
-
65
- return self.state(), reward, self.is_done, info
66
-
67
- def state(self) -> CodeArenaObservation:
68
- if not self.current_task:
69
- raise ValueError("Environment not initialized. Call reset() first.")
70
-
71
- return CodeArenaObservation(
72
- buggy_code=self.current_task.buggy_code,
73
- error_log=self.last_error_log,
74
- test_results=self.last_test_results,
75
- previous_attempts=self.previous_attempts,
76
- )
77
-
78
- # Initialize a global environment instance for the FastAPI wrapper
79
- _env = CodeArenaEnv()
80
-
81
- @asynccontextmanager
82
- async def lifespan(app: FastAPI):
83
- _env.reset()
84
- yield
85
-
86
- app = FastAPI(lifespan=lifespan, title="CodeArena RL Environment")
87
-
88
- @app.post("/reset")
89
- def api_reset(body: dict = None):
90
- task_id = (body or {}).get("task_id")
91
- obs = _env.reset(task_id=task_id)
92
- return {"message": "Environment reset successfully", "observation": obs.model_dump()}
93
-
94
- @app.post("/step")
95
- def api_step(action: CodeArenaAction):
96
- try:
97
- obs, reward, done, info = _env.step(action)
98
- # Safety fallback before force_valid_reward
99
- if reward is None:
100
- reward = 0.5
101
- return {
102
- "observation": obs.model_dump(),
103
- "reward": force_valid_reward(reward),
104
- "done": done,
105
- "info": info
106
- }
107
- except ValueError as e:
108
- raise HTTPException(status_code=400, detail=str(e))
109
-
110
- @app.get("/state")
111
- def api_state():
112
- try:
113
- obs = _env.state()
114
- return {"observation": obs.model_dump()}
115
- except ValueError as e:
116
- raise HTTPException(status_code=400, detail=str(e))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
server/grader.py CHANGED
@@ -1,6 +1,8 @@
 
 
 
1
  from .models import ExecutionResult, TaskInfo
2
 
3
-
4
  def force_valid_reward(value) -> float:
5
  """Hard guarantee: reward is strictly in (0, 1) — never 0 or 1, no exceptions."""
6
  try:
@@ -16,30 +18,85 @@ def force_valid_reward(value) -> float:
16
 
17
  return r
18
 
19
-
20
  def safe_reward(reward) -> float:
21
  """Clamp reward to open interval (0, 1) via force_valid_reward."""
22
  if reward is None:
23
  reward = 0.5
24
  return force_valid_reward(reward)
25
 
26
-
27
  def normalize_reward(passed: int, total: int) -> float:
28
  if total == 0:
29
  return 0.5
30
  raw = passed / total
31
  return force_valid_reward(raw)
32
 
 
33
 
34
- def calculate_reward(exec_result: ExecutionResult, task_info: TaskInfo) -> float:
35
- reward = normalize_reward(exec_result.test_passed, exec_result.test_total)
36
- return force_valid_reward(reward)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  def grade(*args, **kwargs) -> float:
40
  try:
41
- if len(args) == 2:
42
- return calculate_reward(args[0], args[1])
43
  return 0.5
44
  except Exception:
45
  return 0.5
 
1
+ import os
2
+ import json
3
+ from openai import OpenAI
4
  from .models import ExecutionResult, TaskInfo
5
 
 
6
  def force_valid_reward(value) -> float:
7
  """Hard guarantee: reward is strictly in (0, 1) — never 0 or 1, no exceptions."""
8
  try:
 
18
 
19
  return r
20
 
 
21
  def safe_reward(reward) -> float:
22
  """Clamp reward to open interval (0, 1) via force_valid_reward."""
23
  if reward is None:
24
  reward = 0.5
25
  return force_valid_reward(reward)
26
 
 
27
  def normalize_reward(passed: int, total: int) -> float:
28
  if total == 0:
29
  return 0.5
30
  raw = passed / total
31
  return force_valid_reward(raw)
32
 
33
+ _LLM_CACHE = {}
34
 
35
+ def get_llm_quality_score(proposed_fix: str) -> dict:
36
+ if proposed_fix in _LLM_CACHE:
37
+ return _LLM_CACHE[proposed_fix]
38
+
39
+ try:
40
+ client = OpenAI()
41
+ response = client.chat.completions.create(
42
+ model=os.environ.get("JUDGE_MODEL", "gpt-4o-mini"),
43
+ messages=[
44
+ {"role": "system", "content": "You are a code judge. Evaluate the provided Python code on a scale of 0.0 to 1.0 for three metrics: code_quality, security, and correctness. Respond with JSON format strictly matching: {\"code_quality\": 0.0, \"security\": 0.0, \"correctness\": 0.0}"},
45
+ {"role": "user", "content": proposed_fix}
46
+ ],
47
+ response_format={"type": "json_object"}
48
+ )
49
+ result = json.loads(response.choices[0].message.content)
50
+ _LLM_CACHE[proposed_fix] = result
51
+ return result
52
+ except Exception as e:
53
+ print(f"LLM judge error: {e}")
54
+ fallback = {"code_quality": 0.5, "security": 0.5, "correctness": 0.5}
55
+ _LLM_CACHE[proposed_fix] = fallback
56
+ return fallback
57
+
58
+ def calculate_reward_components(exec_result: ExecutionResult, task_info: TaskInfo, proposed_fix: str) -> dict:
59
+ compile_score = 1.0 if not exec_result.runtime_errors else 0.0
60
+
61
+ test_ratio = 0.0
62
+ if exec_result.test_total > 0:
63
+ test_ratio = exec_result.test_passed / exec_result.test_total
64
+
65
+ efficiency = 0.0
66
+ if test_ratio == 1.0:
67
+ if exec_result.execution_time_seconds <= task_info.optimal_time_seconds:
68
+ efficiency = 1.0
69
+ else:
70
+ ratio = exec_result.execution_time_seconds / max(0.001, task_info.optimal_time_seconds)
71
+ efficiency = max(0.0, 1.0 - (ratio - 1.0) / 2.0)
72
+
73
+ llm_scores = get_llm_quality_score(proposed_fix)
74
+
75
+ return {
76
+ "compile_score": compile_score,
77
+ "test_ratio": test_ratio,
78
+ "efficiency": efficiency,
79
+ "llm_correctness": float(llm_scores.get("correctness", 0.5)),
80
+ "llm_security": float(llm_scores.get("security", 0.5)),
81
+ "llm_quality": float(llm_scores.get("code_quality", 0.5))
82
+ }
83
 
84
+ def calculate_reward(exec_result: ExecutionResult, task_info: TaskInfo, proposed_fix: str) -> tuple[float, dict]:
85
+ comps = calculate_reward_components(exec_result, task_info, proposed_fix)
86
+ base_reward = (
87
+ 0.25 * comps["compile_score"] +
88
+ 0.30 * comps["test_ratio"] +
89
+ 0.15 * comps["efficiency"] +
90
+ 0.15 * comps["llm_correctness"] +
91
+ 0.10 * comps["llm_security"] +
92
+ 0.05 * comps["llm_quality"]
93
+ )
94
+ return base_reward, comps
95
 
96
  def grade(*args, **kwargs) -> float:
97
  try:
98
+ if len(args) == 3:
99
+ return calculate_reward(args[0], args[1], args[2])[0]
100
  return 0.5
101
  except Exception:
102
  return 0.5
tasks/__init__.py CHANGED
@@ -1,5 +1,11 @@
1
  from .easy import EASY_TASK
2
  from .medium import MEDIUM_TASK
3
  from .hard import HARD_TASK
 
 
 
 
 
 
4
 
5
- ALL_TASKS = [EASY_TASK, MEDIUM_TASK, HARD_TASK]
 
1
  from .easy import EASY_TASK
2
  from .medium import MEDIUM_TASK
3
  from .hard import HARD_TASK
4
+ from .type_errors.type_error_1 import TASK as TE1
5
+ from .type_errors.type_error_2 import TASK as TE2
6
+ from .type_errors.type_error_3 import TASK as TE3
7
+ from .security_bugs.security_bug_1 import TASK as SB1
8
+ from .security_bugs.security_bug_2 import TASK as SB2
9
+ from .security_bugs.security_bug_3 import TASK as SB3
10
 
11
+ ALL_TASKS = [EASY_TASK, MEDIUM_TASK, HARD_TASK, TE1, TE2, TE3, SB1, SB2, SB3]
tasks/security_bugs/security_bug_1.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task_id": "security_bugs-1",
3
+ "difficulty": "security_bugs",
4
+ "description": "Fix the function to parse JSON safely without using eval().",
5
+ "buggy_code": "import json\ndef parse_user_data(data_string):\n return eval(data_string)",
6
+ "test_code": "\nimport unittest\nimport inspect\nclass TestSecurity1(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(parse_user_data('{\"name\": \"alice\"}'), {\"name\": \"alice\"})\n def test_security(self):\n source = inspect.getsource(parse_user_data)\n self.assertNotIn(\"eval(\", source)\n",
7
+ "optimal_time_seconds": 0.05
8
+ }
tasks/security_bugs/security_bug_1.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from server.models import TaskInfo
2
+
3
+ TASK = TaskInfo(
4
+ task_id="security_bugs-1",
5
+ difficulty="security_bugs",
6
+ description="Fix the function to parse JSON safely without using eval().",
7
+ buggy_code="""import json
8
+ def parse_user_data(data_string):
9
+ return eval(data_string)""",
10
+ test_code="""
11
+ import unittest
12
+ import inspect
13
+ class TestSecurity1(unittest.TestCase):
14
+ def test_normal(self):
15
+ self.assertEqual(parse_user_data('{"name": "alice"}'), {"name": "alice"})
16
+ def test_security(self):
17
+ source = inspect.getsource(parse_user_data)
18
+ self.assertNotIn("eval(", source)
19
+ """,
20
+ optimal_time_seconds=0.05
21
+ )
tasks/security_bugs/security_bug_2.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task_id": "security_bugs-2",
3
+ "difficulty": "security_bugs",
4
+ "description": "Remove the hardcoded secret token and load it from the os.environ dictionary as 'API_TOKEN'.",
5
+ "buggy_code": "import os\ndef get_api_token():\n token = \"secret_12345\"\n return token",
6
+ "test_code": "\nimport unittest\nimport inspect\nimport os\nclass TestSecurity2(unittest.TestCase):\n def test_normal(self):\n os.environ['API_TOKEN'] = 'my_secure_token'\n self.assertEqual(get_api_token(), 'my_secure_token')\n def test_security(self):\n source = inspect.getsource(get_api_token)\n self.assertNotIn(\"secret_12345\", source)\n",
7
+ "optimal_time_seconds": 0.05
8
+ }
tasks/security_bugs/security_bug_2.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from server.models import TaskInfo
2
+
3
+ TASK = TaskInfo(
4
+ task_id="security_bugs-2",
5
+ difficulty="security_bugs",
6
+ description="Remove the hardcoded secret token and load it from the os.environ dictionary as 'API_TOKEN'.",
7
+ buggy_code="""import os
8
+ def get_api_token():
9
+ token = "secret_12345"
10
+ return token""",
11
+ test_code="""
12
+ import unittest
13
+ import inspect
14
+ import os
15
+ class TestSecurity2(unittest.TestCase):
16
+ def test_normal(self):
17
+ os.environ['API_TOKEN'] = 'my_secure_token'
18
+ self.assertEqual(get_api_token(), 'my_secure_token')
19
+ def test_security(self):
20
+ source = inspect.getsource(get_api_token)
21
+ self.assertNotIn("secret_12345", source)
22
+ """,
23
+ optimal_time_seconds=0.05
24
+ )
tasks/security_bugs/security_bug_3.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task_id": "security_bugs-3",
3
+ "difficulty": "security_bugs",
4
+ "description": "Fix the ping command to avoid shell injection. Use a list of arguments and shell=False.",
5
+ "buggy_code": "import subprocess\ndef ping_host(host):\n return subprocess.check_output(f\"ping -c 1 {host}\", shell=True)",
6
+ "test_code": "\nimport unittest\nimport inspect\nclass TestSecurity3(unittest.TestCase):\n def test_security(self):\n source = inspect.getsource(ping_host)\n self.assertNotIn(\"shell=True\", source.replace(\" \", \"\"))\n self.assertIn(\"[\", source)\n",
7
+ "optimal_time_seconds": 0.05
8
+ }
tasks/security_bugs/security_bug_3.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from server.models import TaskInfo
2
+
3
+ TASK = TaskInfo(
4
+ task_id="security_bugs-3",
5
+ difficulty="security_bugs",
6
+ description="Fix the ping command to avoid shell injection. Use a list of arguments and shell=False.",
7
+ buggy_code="""import subprocess
8
+ def ping_host(host):
9
+ return subprocess.check_output(f"ping -c 1 {host}", shell=True)""",
10
+ test_code="""
11
+ import unittest
12
+ import inspect
13
+ class TestSecurity3(unittest.TestCase):
14
+ def test_security(self):
15
+ source = inspect.getsource(ping_host)
16
+ self.assertNotIn("shell=True", source.replace(" ", ""))
17
+ self.assertIn("[", source)
18
+ """,
19
+ optimal_time_seconds=0.05
20
+ )
tasks/type_errors/type_error_1.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task_id": "type_errors-1",
3
+ "difficulty": "type_errors",
4
+ "description": "Fix the function to sum a list of numbers that might be passed as strings. It currently tries to add int and str.",
5
+ "buggy_code": "def sum_all(items):\n total = 0\n for item in items:\n total = total + item\n return total",
6
+ "test_code": "\nimport unittest\nclass TestTypeError1(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(sum_all([1, 2, 3]), 6)\n def test_strings(self):\n self.assertEqual(sum_all(['1', '2', '3']), 6)\n def test_mixed(self):\n self.assertEqual(sum_all([1, '2', 3]), 6)\n",
7
+ "optimal_time_seconds": 0.05
8
+ }
tasks/type_errors/type_error_1.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from server.models import TaskInfo
2
+
3
+ TASK = TaskInfo(
4
+ task_id="type_errors-1",
5
+ difficulty="type_errors",
6
+ description="Fix the function to sum a list of numbers that might be passed as strings. It currently tries to add int and str.",
7
+ buggy_code="""def sum_all(items):
8
+ total = 0
9
+ for item in items:
10
+ total = total + item
11
+ return total""",
12
+ test_code="""
13
+ import unittest
14
+ class TestTypeError1(unittest.TestCase):
15
+ def test_normal(self):
16
+ self.assertEqual(sum_all([1, 2, 3]), 6)
17
+ def test_strings(self):
18
+ self.assertEqual(sum_all(['1', '2', '3']), 6)
19
+ def test_mixed(self):
20
+ self.assertEqual(sum_all([1, '2', 3]), 6)
21
+ """,
22
+ optimal_time_seconds=0.05
23
+ )
tasks/type_errors/type_error_2.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task_id": "type_errors-2",
3
+ "difficulty": "type_errors",
4
+ "description": "Fix the function to count frequencies. It incorrectly calls .append() on a dict.",
5
+ "buggy_code": "def count_frequencies(words):\n counts = {}\n for word in words:\n if word not in counts:\n counts.append({word: 1})\n else:\n counts[word] += 1\n return counts",
6
+ "test_code": "\nimport unittest\nclass TestTypeError2(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(count_frequencies(['apple', 'banana', 'apple']), {'apple': 2, 'banana': 1})\n def test_empty(self):\n self.assertEqual(count_frequencies([]), {})\n",
7
+ "optimal_time_seconds": 0.05
8
+ }
tasks/type_errors/type_error_2.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from server.models import TaskInfo
2
+
3
+ TASK = TaskInfo(
4
+ task_id="type_errors-2",
5
+ difficulty="type_errors",
6
+ description="Fix the function to count frequencies. It incorrectly calls .append() on a dict.",
7
+ buggy_code="""def count_frequencies(words):
8
+ counts = {}
9
+ for word in words:
10
+ if word not in counts:
11
+ counts.append({word: 1})
12
+ else:
13
+ counts[word] += 1
14
+ return counts""",
15
+ test_code="""
16
+ import unittest
17
+ class TestTypeError2(unittest.TestCase):
18
+ def test_normal(self):
19
+ self.assertEqual(count_frequencies(['apple', 'banana', 'apple']), {'apple': 2, 'banana': 1})
20
+ def test_empty(self):
21
+ self.assertEqual(count_frequencies([]), {})
22
+ """,
23
+ optimal_time_seconds=0.05
24
+ )
tasks/type_errors/type_error_3.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task_id": "type_errors-3",
3
+ "difficulty": "type_errors",
4
+ "description": "Fix the function to format names. It incorrectly calls .upper() on an int ID.",
5
+ "buggy_code": "def format_records(records):\n formatted = []\n for user_id, name in records:\n formatted.append(f\"{user_id.upper()} - {name.upper()}\")\n return formatted",
6
+ "test_code": "\nimport unittest\nclass TestTypeError3(unittest.TestCase):\n def test_normal(self):\n self.assertEqual(format_records([(1, 'alice'), (2, 'bob')]), ['1 - ALICE', '2 - BOB'])\n",
7
+ "optimal_time_seconds": 0.05
8
+ }
tasks/type_errors/type_error_3.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from server.models import TaskInfo
2
+
3
+ TASK = TaskInfo(
4
+ task_id="type_errors-3",
5
+ difficulty="type_errors",
6
+ description="Fix the function to format names. It incorrectly calls .upper() on an int ID.",
7
+ buggy_code="""def format_records(records):
8
+ formatted = []
9
+ for user_id, name in records:
10
+ formatted.append(f"{user_id.upper()} - {name.upper()}")
11
+ return formatted""",
12
+ test_code="""
13
+ import unittest
14
+ class TestTypeError3(unittest.TestCase):
15
+ def test_normal(self):
16
+ self.assertEqual(format_records([(1, 'alice'), (2, 'bob')]), ['1 - ALICE', '2 - BOB'])
17
+ """,
18
+ optimal_time_seconds=0.05
19
+ )
train_grpo.ipynb ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# GRPO Training with CodeArena RL Benchmark\n",
8
+ "\n",
9
+ "This notebook demonstrates how to connect our custom `codearena-rl-benchmark` environment to HuggingFace's `trl.GRPOTrainer`."
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": null,
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "!pip install trl transformers datasets openenv-py httpx\n",
19
+ "!git clone https://github.com/havinashpatil/meta.git\n",
20
+ "!cd meta && pip install -r requirements.txt"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "import torch\n",
30
+ "from datasets import Dataset\n",
31
+ "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
32
+ "from trl import GRPOConfig, GRPOTrainer\n",
33
+ "import httpx\n",
34
+ "\n",
35
+ "# Start the backend server in the background (Colab trick)\n",
36
+ "import subprocess\n",
37
+ "import time\n",
38
+ "subprocess.Popen([\"uvicorn\", \"server.app:app\", \"--port\", \"7860\", \"--app-dir\", \"meta\"])\n",
39
+ "time.sleep(5) # Wait for server to start"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": null,
45
+ "metadata": {},
46
+ "outputs": [],
47
+ "source": [
48
+ "def codearena_reward_func(completions, prompts):\n",
49
+ " \"\"\"\n",
50
+ " Reward function that queries the CodeArena OpenEnv server.\n",
51
+ " For each proposed fix in `completions`, we step the environment.\n",
52
+ " \"\"\"\n",
53
+ " rewards = []\n",
54
+ " for completion in completions:\n",
55
+ " # Clean the generated code\n",
56
+ " proposed_fix = completion[0].get('content', '').strip()\n",
57
+ " if proposed_fix.startswith('```python'):\n",
58
+ " proposed_fix = proposed_fix[9:].replace('```', '').strip()\n",
59
+ " \n",
60
+ " try:\n",
61
+ " # Step the environment\n",
62
+ " res = httpx.post(\n",
63
+ " \"http://localhost:7860/step\",\n",
64
+ " json={\"proposed_fix\": proposed_fix},\n",
65
+ " timeout=10.0\n",
66
+ " )\n",
67
+ " res.raise_for_status()\n",
68
+ " reward = res.json().get('reward', 0.0)\n",
69
+ " rewards.append(reward)\n",
70
+ " except Exception as e:\n",
71
+ " print(f\"Env Error: {e}\")\n",
72
+ " rewards.append(0.0)\n",
73
+ " \n",
74
+ " return rewards"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": null,
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "# Load Model\n",
84
+ "model_name = \"Qwen/Qwen2.5-Coder-1.5B\"\n",
85
+ "model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map=\"auto\")\n",
86
+ "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
87
+ "tokenizer.pad_token = tokenizer.eos_token\n",
88
+ "\n",
89
+ "# Sample training dataset (prompts extracted from tasks)\n",
90
+ "# In a real setup, you'd reset the env for each prompt to get the initial buggy_code.\n",
91
+ "dataset = Dataset.from_dict({\n",
92
+ " \"prompt\": [\n",
93
+ " \"Fix this Python code:\\ndef average_list(numbers)\\n if length(numbers) == 0:\\n return 0\\n return sum(numbers) / length(numbers)\"\n",
94
+ " ]\n",
95
+ "})\n",
96
+ "\n",
97
+ "# Initialize GRPO Trainer\n",
98
+ "training_args = GRPOConfig(\n",
99
+ " output_dir=\"./codearena-grpo\",\n",
100
+ " learning_rate=1e-5,\n",
101
+ " max_steps=50,\n",
102
+ " per_device_train_batch_size=2,\n",
103
+ " gradient_accumulation_steps=2,\n",
104
+ ")\n",
105
+ "\n",
106
+ "trainer = GRPOTrainer(\n",
107
+ " model=model,\n",
108
+ " reward_funcs=codearena_reward_func,\n",
109
+ " args=training_args,\n",
110
+ " train_dataset=dataset,\n",
111
+ ")\n",
112
+ "\n",
113
+ "trainer.train()"
114
+ ]
115
+ }
116
+ ],
117
+ "metadata": {
118
+ "kernelspec": {
119
+ "display_name": "Python 3",
120
+ "language": "python",
121
+ "name": "python3"
122
+ },
123
+ "language_info": {
124
+ "codemirror_mode": {
125
+ "name": "ipython",
126
+ "version": 3
127
+ },
128
+ "file_extension": ".py",
129
+ "mimetype": "text/x-python",
130
+ "name": "python",
131
+ "nbconvert_exporter": "python",
132
+ "pygments_lexer": "ipython3",
133
+ "version": "3.10.12"
134
+ }
135
+ },
136
+ "nbformat": 4,
137
+ "nbformat_minor": 4
138
+ }