Spaces:
Running
Running
File size: 3,079 Bytes
9b47159 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | # TASKS.md β Environment Task Definition
> Senior OpenEnv Engineer Rule: **Real-world task. No toys. No games.**
---
## π― Chosen Environment: Code Review Environment
**Name:** `code_review_env`
**Domain:** Software Engineering / Developer Tooling
**Task:** The agent acts as a code reviewer. Given a code diff or snippet, the agent must produce a structured, high-quality review identifying bugs, style issues, and improvement suggestions.
---
## Why This Is Real-World β
- Code review is a high-value, daily engineering task
- Clear, measurable correctness signals (bug found / not found, severity match)
- Rich feedback loop: agent learns what good reviews look like
- Direct production utility β can be deployed in CI/CD pipelines
---
## Episode Structure
```
reset()
β
βββ Agent receives: code snippet + task context
(e.g., language, PR description, critical path flag)
step(action)
β
βββ Agent sends: structured review
(issues: List[Issue], summary: str, severity: Severity)
βββ Environment returns:
reward (float), feedback (str), done (bool)
```
---
## Action Space
```python
@dataclass
class CodeReviewAction(Action):
issues: List[str] # List of identified issues
summary: str # Overall review summary
severity: str # "low" | "medium" | "high" | "critical"
metadata: Dict[str, Any] # Optional extra context
```
---
## Observation Space
```python
@dataclass
class CodeReviewObservation(Observation):
done: bool
reward: float
code_snippet: str # Code to review (current step)
language: str # e.g., "python", "javascript"
context: str # PR description or task context
ground_truth_issues: List[str] # Hidden during training rollout
feedback: str # Human-readable feedback on last action
step_number: int
```
---
## State
```python
@dataclass
class CodeReviewState(State):
episode_id: Optional[str]
step_count: int
total_snippets: int # How many snippets in this episode
cumulative_reward: float
language: str
```
---
## Episode Flow
| Step | Agent Receives | Agent Sends | Env Returns |
|------|---------------|-------------|-------------|
| 1 | Code snippet #1 + context | Structured review | Reward + feedback |
| 2 | Code snippet #2 (harder) | Structured review | Reward + feedback |
| β¦ | β¦ | β¦ | β¦ |
| N | Final snippet | Final review | Terminal reward, done=True |
---
## Data Sources
- [CodeSearchNet](https://github.com/github/CodeSearchNet) β multi-language code samples
- Synthetic bug injection (off-by-one, null dereference, SQL injection, etc.)
- Human-curated review gold standards (severity labels)
---
## Difficulty Levels
| Level | Description |
|-------|-------------|
| Easy | Obvious syntax error or unused variable |
| Medium | Logic bug, missing edge case handling |
| Hard | Security vulnerability, concurrency issue |
| Critical | Data corruption / memory leak pattern |
|