File size: 3,079 Bytes
9b47159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# TASKS.md β€” Environment Task Definition

> Senior OpenEnv Engineer Rule: **Real-world task. No toys. No games.**

---

## 🎯 Chosen Environment: Code Review Environment

**Name:** `code_review_env`  
**Domain:** Software Engineering / Developer Tooling  
**Task:** The agent acts as a code reviewer. Given a code diff or snippet, the agent must produce a structured, high-quality review identifying bugs, style issues, and improvement suggestions.

---

## Why This Is Real-World βœ…
- Code review is a high-value, daily engineering task
- Clear, measurable correctness signals (bug found / not found, severity match)
- Rich feedback loop: agent learns what good reviews look like
- Direct production utility β€” can be deployed in CI/CD pipelines

---

## Episode Structure

```
reset()
  β”‚
  └── Agent receives: code snippet + task context
        (e.g., language, PR description, critical path flag)

step(action)
  β”‚
  └── Agent sends: structured review
        (issues: List[Issue], summary: str, severity: Severity)

  └── Environment returns:
        reward (float), feedback (str), done (bool)
```

---

## Action Space

```python
@dataclass
class CodeReviewAction(Action):
    issues: List[str]          # List of identified issues
    summary: str               # Overall review summary
    severity: str              # "low" | "medium" | "high" | "critical"
    metadata: Dict[str, Any]   # Optional extra context
```

---

## Observation Space

```python
@dataclass
class CodeReviewObservation(Observation):
    done: bool
    reward: float
    code_snippet: str          # Code to review (current step)
    language: str              # e.g., "python", "javascript"
    context: str               # PR description or task context
    ground_truth_issues: List[str]   # Hidden during training rollout
    feedback: str              # Human-readable feedback on last action
    step_number: int
```

---

## State

```python
@dataclass
class CodeReviewState(State):
    episode_id: Optional[str]
    step_count: int
    total_snippets: int        # How many snippets in this episode
    cumulative_reward: float
    language: str
```

---

## Episode Flow

| Step | Agent Receives | Agent Sends | Env Returns |
|------|---------------|-------------|-------------|
| 1    | Code snippet #1 + context | Structured review | Reward + feedback |
| 2    | Code snippet #2 (harder) | Structured review | Reward + feedback |
| … | … | … | … |
| N    | Final snippet | Final review | Terminal reward, done=True |

---

## Data Sources
- [CodeSearchNet](https://github.com/github/CodeSearchNet) β€” multi-language code samples
- Synthetic bug injection (off-by-one, null dereference, SQL injection, etc.)
- Human-curated review gold standards (severity labels)

---

## Difficulty Levels
| Level | Description |
|-------|-------------|
| Easy | Obvious syntax error or unused variable |
| Medium | Logic bug, missing edge case handling |
| Hard | Security vulnerability, concurrency issue |
| Critical | Data corruption / memory leak pattern |