Spaces:
Configuration error
Configuration error
| name: code-review-env | |
| version: "1.0.0" | |
| description: > | |
| An OpenEnv-compliant AI training environment that simulates professional | |
| Python code review. Agents learn to identify bugs, security vulnerabilities, | |
| performance issues, style problems, and documentation gaps across three | |
| progressively harder tasks. | |
| tags: | |
| - openenv | |
| - code-review | |
| - python | |
| - security | |
| - software-engineering | |
| author: imaginephoenix / rawgenn.tech | |
| license: MIT | |
| environment: | |
| class: CodeReviewEnv | |
| module: env.environment | |
| entrypoint: app.py | |
| framework: fastapi | |
| observation_space: | |
| type: object | |
| description: > | |
| What the agent sees each step. Contains the code snippet to review, | |
| task instructions, all previously submitted comments, and optional | |
| feedback from the last step. | |
| fields: | |
| task_id: | |
| type: string | |
| description: Identifier of the active task | |
| step: | |
| type: integer | |
| description: Current step number (0-indexed) | |
| snippet: | |
| type: object | |
| description: Python source code to review | |
| fields: | |
| file_name: { type: string } | |
| source: { type: string, description: "Full Python source with line numbers" } | |
| language: { type: string, const: "python" } | |
| instructions: | |
| type: string | |
| description: Review instructions and scope for this task | |
| previous_comments: | |
| type: array | |
| description: All review comments submitted in prior steps | |
| feedback: | |
| type: string | |
| nullable: true | |
| description: Environment feedback on the most recent action | |
| done: | |
| type: boolean | |
| action_space: | |
| type: object | |
| description: > | |
| What the agent submits. A list of review comments (each with line, | |
| category, severity, message, optional suggestion) plus an optional | |
| overall summary and a submit flag. | |
| fields: | |
| comments: | |
| type: array | |
| items: | |
| type: object | |
| fields: | |
| line: { type: integer, nullable: true, description: "1-indexed line number" } | |
| category: | |
| type: string | |
| enum: [bug, security, performance, style, documentation] | |
| severity: | |
| type: string | |
| enum: [low, medium, high, critical] | |
| message: { type: string, minLength: 5, maxLength: 500 } | |
| suggestion: { type: string, nullable: true, maxLength: 500 } | |
| summary: | |
| type: string | |
| nullable: true | |
| description: "Required for task_3_hard; optional otherwise" | |
| submit: | |
| type: boolean | |
| description: "Set true to finalise the review and trigger the grader" | |
| reward: | |
| type: float | |
| range: [-1.0, 1.0] | |
| description: > | |
| Shaped reward with partial progress signals. Incremental positive reward | |
| for each new valid comment added (proportional to issue severity). On | |
| submit: final grader score mapped to [-0.2, 1.0]. Penalties for false | |
| positives, missed criticals, and spamming low-quality comments. | |
| tasks: | |
| - id: task_1_easy | |
| title: "Bug Detection & Style Review" | |
| difficulty: easy | |
| categories: [bug, style] | |
| max_steps: 5 | |
| passing_threshold: 0.55 | |
| description: > | |
| Review calculator.py (31 lines) for division-by-zero bugs, off-by-one | |
| errors, empty-collection crashes, and Python style anti-patterns. | |
| - id: task_2_medium | |
| title: "Security & Performance Audit" | |
| difficulty: medium | |
| categories: [security, performance] | |
| max_steps: 7 | |
| passing_threshold: 0.60 | |
| description: > | |
| Audit user_service.py (55 lines) for SQL injection, broken MD5 password | |
| hashing, unbounded DB queries, and connection churn. Missed critical | |
| security issues carry heavy penalties. | |
| - id: task_3_hard | |
| title: "Comprehensive Code Review" | |
| difficulty: hard | |
| categories: [bug, security, performance, style, documentation] | |
| max_steps: 10 | |
| passing_threshold: 0.65 | |
| description: > | |
| Full production-grade review of data_pipeline.py (49 lines). Covers | |
| all five categories including shell injection, unsafe pickle | |
| deserialization, ZeroDivisionError, and missing docstrings. An overall | |
| written summary is required. | |
| api_endpoints: | |
| - path: /reset | |
| method: POST | |
| description: Start or restart an episode | |
| - path: /step | |
| method: POST | |
| description: Submit an action | |
| - path: /state | |
| method: GET | |
| description: Get full serialisable state | |
| - path: /tasks | |
| method: GET | |
| description: List all available tasks | |
| - path: /health | |
| method: GET | |
| description: Health check | |
| baseline: | |
| model: gpt-4o | |
| script: baseline_agent.py | |
| expected_scores: | |
| task_1_easy: ~0.75 | |
| task_2_medium: ~0.65 | |
| task_3_hard: ~0.55 | |
| docker: | |
| base_image: python:3.11-slim | |
| port: 7860 | |
| build: docker build -t code-review-env . | |
| run: docker run -p 7860:7860 code-review-env | |
| huggingface: | |
| space_sdk: docker | |
| tags: [openenv, code-review, ai-agent, evaluation] | |