File size: 3,117 Bytes
0135a17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# NOT_TO_DO.md β€” Bug Triage Environment Anti-Patterns

>  Violations here = disqualification or severe score penalty.

---

## Disqualification Risks

1. **Hardcoded grader scores** β€” Grader must evaluate actual agent output against ground truth. If every episode returns the same score, DQ.
2. **Missing endpoints** β€” `/tasks`, `/grader`, `/baseline` are mandatory hackathon endpoints. Missing any = DQ.
3. **Non-running baseline** β€” `baseline.py` must execute without error. If it crashes on import or at runtime = DQ.
4. **Docker build failure** β€” `Dockerfile` must build cleanly. Always test `docker build` before submission.
5. **`/reset` returns error** β€” This is the first thing evaluators test.

---

## Technical Anti-Patterns

| Don't | Why | Do Instead |
|-------|-----|-----------|
| Expose `ground_truth` in observation | Agent sees answers β†’ invalid training | Only return bug report fields |
| Use hardcoded bug selection | Same bug every episode β†’ DQ | `random.choice(self._bugs)` |
| Return reward outside [-0.5, 1.0] | Breaks GRPO training stability | Clamp: `max(-0.5, min(1.0, reward))` |
| Grader score outside [0.0, 1.0] | Violates spec | `max(0.0, min(1.0, score))` |
| Skip Pydantic validation | Runtime crashes in production | All models use `BaseModel` |
| Use mutable default args | Shared state bugs | `Field(default_factory=...)` |
| Forget `model_config = {"use_enum_values": True}` | Enums serialize as objects instead of strings | Set on BugTriageAction |

---

## Bug Triage Specific Don'ts

| Don't | Why |
|-------|-----|
| Accept any string for `bug_type` | Use `BugType` enum β€” only 6 valid values |
| Accept any developer name | Validate against `DEVELOPERS` list |
| Mix up reward vs grader | Reward = training signal. Grader = eval score. Different ranges. |
| Build a classifier model | We're building an RL *environment*, not a model |
| Train during hackathon eval | Environment must serve episodes, not train |

---

## Pre-Submission Validation (Run Before Submitting)

```bash
# 1. Health check
curl http://localhost:8000/health
# β†’ {"status":"healthy"}

# 2. Reset all 3 tasks
curl -X POST http://localhost:8000/reset -d '{"task_id":"task_1"}'
curl -X POST http://localhost:8000/reset -d '{"task_id":"task_2"}'
curl -X POST http://localhost:8000/reset -d '{"task_id":"task_3"}'

# 3. Tasks endpoint
curl http://localhost:8000/tasks
# β†’ 3 tasks with action schemas

# 4. Full episode (reset β†’ step β†’ grader)
EP=$(curl -s -X POST http://localhost:8000/reset -d '{"task_id":"task_1"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['episode_id'])")
curl -X POST http://localhost:8000/step -d "{\"episode_id\":\"$EP\",\"action\":{\"task_id\":\"task_1\",\"bug_type\":\"crash\"}}"
curl -X POST http://localhost:8000/grader -d "{\"episode_id\":\"$EP\",\"task_id\":\"task_1\"}"

# 5. Baseline
GEMINI_API_KEY="..." python -m bug_triage_env.baseline --all-tasks --episodes 3

# 6. Docker build
docker build -f bug_triage_env/server/Dockerfile -t bug-triage-env .
docker run -d -p 8000:8000 bug-triage-env
curl http://localhost:8000/health
```