Siteshcodes commited on
Commit
47bc4be
Β·
1 Parent(s): 7eb0325

docs: comprehensive README with spec compliance checklist, log format, API examples

Browse files
Files changed (1) hide show
  1. README.md +151 -64
README.md CHANGED
@@ -9,68 +9,93 @@ tags:
9
  - openenv
10
  ---
11
 
12
- # Bug Triage Environment πŸ›
 
 
13
 
14
  An OpenEnv reinforcement learning environment where an AI agent triages GitHub-style bug reports β€” assigning priority, labels, team ownership, and milestone β€” exactly as a senior engineer would.
15
 
16
- ## Why this environment?
 
 
 
 
 
17
 
18
  Every software team triages dozens of bug reports weekly. Getting prioritization wrong delays critical fixes and wastes engineering time. This environment trains and evaluates agents on real triage decision-making, with graders that reflect actual engineering judgment.
19
 
20
- ## Action space
 
 
 
 
 
 
 
 
 
21
 
22
  | Field | Type | Values |
23
  |-----------------|-----------|-------------------------------------------------|
24
- | `priority` | string | `P0` `P1` `P2` `P3` |
25
- | `labels` | list[str] | `bug` `performance` `security` `ux` `docs` … |
26
- | `assigned_team` | string | `backend` `frontend` `infra` `security` `devx` |
27
- | `milestone` | string | `hotfix` `v2.1` `backlog` |
28
- | `reasoning` | string | Free-form explanation |
29
 
30
- ## Observation space
31
 
32
  | Field | Type | Description |
33
  |--------------|-----------|------------------------------------------|
34
- | `bug_report` | BugReport | Title, body, author, comments |
35
- | `task_id` | string | Current difficulty: easy / medium / hard |
36
- | `score` | float | Cumulative score this episode |
37
- | `reward` | float | Reward from last action (0.05–0.95) |
38
  | `feedback` | string | Human-readable grader feedback |
39
  | `done` | bool | Episode complete flag |
40
 
 
 
41
  ## Tasks
42
 
43
- ### Task 1 β€” Easy (Priority labeling)
44
- Agent assigns a single P0–P3 priority to a bug report.
45
- - Grader: exact match = 0.95, one level off = 0.5, else 0.05
46
- - Grader weight: priority 100%
47
- - Reward range: 0.05–0.95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- ### Task 2 β€” Medium (Priority + labels + team)
50
- Agent assigns priority, category labels, and team routing.
51
- - Grader: priority 45% + label Jaccard similarity 40% + team routing 15%
52
- - Reward range: 0.05–0.95
53
 
54
- ### Task 3 β€” Hard (Full triage)
55
- Agent must assign priority, labels, team, and milestone. Security escalation failures are penalized.
56
- - Grader: priority 35% + labels 30% + team 20% + milestone 15%
57
- - Penalty: βˆ’0.15 for missing security escalation
58
- - Reward range: 0.05–0.95
59
 
60
- ## Reward function
 
 
 
 
 
61
 
62
- Rewards are provided at every step (not just end of episode):
63
- - Partial credit for close-but-not-exact priority (0.5 vs 0.05 vs 0.95)
64
- - Label overlap via Jaccard similarity (continuous signal)
65
- - Team routing accuracy (binary, but weighted)
66
- - Security escalation penalty discourages ignoring critical signals
67
- - All scores clamped strictly to (0.05, 0.95)
68
 
69
  ## Setup
70
 
71
- ### Run locally
72
  ```bash
73
- git clone https://huggingface.co/spaces/Siteshcodes/bug-triage-env
74
  cd bug-triage-env
75
  pip install -r server/requirements.txt
76
  uvicorn server.app:app --host 0.0.0.0 --port 7860
@@ -82,9 +107,9 @@ docker build -t bug-triage-env .
82
  docker run -p 7860:7860 bug-triage-env
83
  ```
84
 
85
- ### Run inference (hackathon submission script)
86
  ```bash
87
- pip install openai openenv-core
88
  export API_BASE_URL=https://router.huggingface.co/v1
89
  export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
90
  export HF_TOKEN=your_hf_token_here
@@ -92,27 +117,31 @@ export ENV_BASE_URL=https://siteshcodes-bug-triage-env.hf.space
92
  python inference.py
93
  ```
94
 
95
- ### Run baseline (development script)
96
- ```bash
97
- pip install groq openenv-core
98
- export GROQ_API_KEY=your_key_here
99
- python baseline.py
100
- ```
101
 
102
- Get a free Groq API key at [console.groq.com](https://console.groq.com).
 
 
 
 
 
103
 
104
- ## Baseline scores
 
 
105
 
106
  Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
107
 
108
- | Task | Score |
109
- |------------|-------|
110
- | Easy | 0.950 |
111
- | Medium | 0.500 |
112
- | Hard | 0.850 |
113
- | **Avg** | **0.767** |
 
 
114
 
115
- Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
116
 
117
  ## API Endpoints
118
 
@@ -125,20 +154,78 @@ Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
125
  | GET | `/tasks` | List all tasks with grader info |
126
  | GET | `/tasks/{id}` | Get specific task metadata |
127
 
128
- ## Project structure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  ```
131
  bug-triage-env/
132
  β”œβ”€β”€ server/
133
- β”‚ β”œβ”€β”€ app.py # FastAPI + OpenEnv entrypoint
134
- β”‚ β”œβ”€β”€ environment.py # BugTriageEnvironment core logic
135
- β”‚ β”œβ”€β”€ task.py # Bug reports + graders
 
136
  β”‚ └── requirements.txt
137
- β”œβ”€β”€ model.py # Pydantic models
138
- β”œβ”€β”€ client.py # HTTP client
139
- β”œβ”€β”€ baseline.py # Groq development script
140
- β”œβ”€β”€ inference.py # OpenAI client submission script
141
- β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
142
- β”œβ”€β”€ Dockerfile
143
  └── README.md
144
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - openenv
10
  ---
11
 
12
+ # πŸ› Bug Triage Environment
13
+
14
+ > **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
15
 
16
  An OpenEnv reinforcement learning environment where an AI agent triages GitHub-style bug reports β€” assigning priority, labels, team ownership, and milestone β€” exactly as a senior engineer would.
17
 
18
+ **Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
19
+ **GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
20
+
21
+ ---
22
+
23
+ ## Why This Environment?
24
 
25
  Every software team triages dozens of bug reports weekly. Getting prioritization wrong delays critical fixes and wastes engineering time. This environment trains and evaluates agents on real triage decision-making, with graders that reflect actual engineering judgment.
26
 
27
+ **Key features:**
28
+ - 🎯 Simulates a real-world engineering task (not a game or toy)
29
+ - πŸ“Š 3 tasks of increasing difficulty with deterministic graders
30
+ - πŸ”„ Meaningful partial-credit reward function
31
+ - πŸ›‘οΈ Security escalation penalty for missed critical vulnerabilities
32
+ - πŸ“¦ Full OpenEnv spec compliance: `step()` / `reset()` / `state()`
33
+
34
+ ---
35
+
36
+ ## Action Space
37
 
38
  | Field | Type | Values |
39
  |-----------------|-----------|-------------------------------------------------|
40
+ | `priority` | string | `P0` Β· `P1` Β· `P2` Β· `P3` |
41
+ | `labels` | list[str] | `bug` Β· `performance` Β· `security` Β· `ux` Β· `data-integrity` Β· `payments` … |
42
+ | `assigned_team` | string | `backend` Β· `frontend` Β· `infra` Β· `security` Β· `devx` |
43
+ | `milestone` | string | `hotfix` Β· `v2.1` Β· `backlog` |
44
+ | `reasoning` | string | Free-form explanation of triage decision |
45
 
46
+ ## Observation Space
47
 
48
  | Field | Type | Description |
49
  |--------------|-----------|------------------------------------------|
50
+ | `bug_report` | BugReport | Title, body, author, labels_hint, comments |
51
+ | `task_id` | string | Current difficulty: `easy` / `medium` / `hard` |
52
+ | `score` | float | Score from grader (0.0–1.0) |
53
+ | `reward` | float | Reward from last action (0.0–1.0) |
54
  | `feedback` | string | Human-readable grader feedback |
55
  | `done` | bool | Episode complete flag |
56
 
57
+ ---
58
+
59
  ## Tasks
60
 
61
+ ### Task 1 β€” Easy: Priority Assignment
62
+ Assign a single P0–P3 priority to a bug report.
63
+ - **Grader:** `server.task:priority_match`
64
+ - **Scoring:** exact match β†’ 0.95, one level off β†’ 0.50, else β†’ 0.05
65
+ - **Weight:** priority 100%
66
+ - **Reward range:** (0.0, 1.0) β€” strictly exclusive
67
+
68
+ ### Task 2 β€” Medium: Priority + Labels + Team
69
+ Assign priority, category labels, and team routing.
70
+ - **Grader:** `server.task:priority_label_team`
71
+ - **Scoring:** priority 45% + label Jaccard similarity 40% + team routing 15%
72
+ - **Reward range:** (0.0, 1.0) β€” strictly exclusive
73
+
74
+ ### Task 3 β€” Hard: Full Triage
75
+ Full triage: priority, labels, team, and milestone. Security escalation failures are penalized.
76
+ - **Grader:** `server.task:full_triage`
77
+ - **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
78
+ - **Penalty:** βˆ’0.15 for missing security escalation (e.g., SQL injection assigned to `backend` instead of `security`)
79
+ - **Reward range:** (0.0, 1.0) β€” strictly exclusive
80
 
81
+ ---
 
 
 
82
 
83
+ ## Reward Function
 
 
 
 
84
 
85
+ Rewards provide meaningful partial-credit signals at every step:
86
+ - **Priority:** Close-but-wrong gets partial credit (0.50 for 1-level off vs 0.05 for 2+ levels off vs 0.95 for exact match)
87
+ - **Labels:** Jaccard similarity between predicted and expected label sets (continuous signal)
88
+ - **Team routing:** Binary accuracy, weighted per task difficulty
89
+ - **Security escalation:** Hard penalty (βˆ’0.15) discourages ignoring critical security signals
90
+ - **Clamping:** All scores strictly within (0.0, 1.0) β€” never exactly 0 or 1
91
 
92
+ ---
 
 
 
 
 
93
 
94
  ## Setup
95
 
96
+ ### Run Locally
97
  ```bash
98
+ git clone https://github.com/Siteshcodes/bug-triage-env.git
99
  cd bug-triage-env
100
  pip install -r server/requirements.txt
101
  uvicorn server.app:app --host 0.0.0.0 --port 7860
 
107
  docker run -p 7860:7860 bug-triage-env
108
  ```
109
 
110
+ ### Run Inference (Hackathon Submission Script)
111
  ```bash
112
+ pip install openai openenv-core requests pydantic
113
  export API_BASE_URL=https://router.huggingface.co/v1
114
  export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
115
  export HF_TOKEN=your_hf_token_here
 
117
  python inference.py
118
  ```
119
 
120
+ ### Environment Variables
 
 
 
 
 
121
 
122
+ | Variable | Description | Required |
123
+ |----------------|--------------------------------------|----------|
124
+ | `API_BASE_URL` | LLM API endpoint | Yes |
125
+ | `MODEL_NAME` | Model identifier for inference | Yes |
126
+ | `HF_TOKEN` | Hugging Face / API key | Yes |
127
+ | `ENV_BASE_URL` | Bug Triage environment URL | Optional |
128
 
129
+ ---
130
+
131
+ ## Baseline Scores
132
 
133
  Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
134
 
135
+ | Task | Difficulty | Score |
136
+ |------------|------------|-------|
137
+ | Easy | easy | 0.95 |
138
+ | Medium | medium | 0.50 |
139
+ | Hard | hard | 0.85 |
140
+ | **Average**| | **0.77** |
141
+
142
+ > Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
143
 
144
+ ---
145
 
146
  ## API Endpoints
147
 
 
154
  | GET | `/tasks` | List all tasks with grader info |
155
  | GET | `/tasks/{id}` | Get specific task metadata |
156
 
157
+ ### Example: Reset + Step
158
+
159
+ ```bash
160
+ # Reset for easy task
161
+ curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
162
+ -H "Content-Type: application/json" \
163
+ -d '{"task_id": "easy"}'
164
+
165
+ # Submit triage action
166
+ curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
167
+ -H "Content-Type: application/json" \
168
+ -d '{"action": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix", "reasoning": "App crash affecting all users"}}'
169
+ ```
170
+
171
+ ---
172
+
173
+ ## Inference Log Format
174
+
175
+ The inference script emits structured logs per the OpenEnv spec:
176
+
177
+ ```
178
+ [START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
179
+ [STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
180
+ [END] success=true steps=1 score=0.95 rewards=0.95
181
+
182
+ [START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
183
+ [STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
184
+ [END] success=true steps=1 score=0.85 rewards=0.85
185
+
186
+ [START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
187
+ [STEP] step=1 action=priority=P0,team=security,milestone=hotfix reward=0.72 done=true error=null
188
+ [END] success=true steps=1 score=0.72 rewards=0.72
189
+ ```
190
+
191
+ Each task gets its own `[START]` β†’ `[STEP]` β†’ `[END]` block.
192
+
193
+ ---
194
+
195
+ ## Project Structure
196
 
197
  ```
198
  bug-triage-env/
199
  β”œβ”€β”€ server/
200
+ β”‚ β”œβ”€β”€ app.py # FastAPI + OpenEnv stateful endpoints
201
+ β”‚ β”œβ”€β”€ environment.py # BugTriageEnvironment (reset/step/state)
202
+ β”‚ β”œβ”€β”€ task.py # 15 bug reports + 3 graders
203
+ β”‚ β”œβ”€β”€ __init__.py
204
  β”‚ └── requirements.txt
205
+ β”œβ”€β”€ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
206
+ β”œβ”€β”€ inference.py # OpenAI client submission script (per-task logs)
207
+ β”œβ”€β”€ openenv.yaml # OpenEnv spec manifest (3 tasks with graders)
208
+ β”œβ”€β”€ Dockerfile # Docker container config
209
+ β”œβ”€β”€ pyproject.toml # Package metadata
 
210
  └── README.md
211
+ ```
212
+
213
+ ---
214
+
215
+ ## OpenEnv Spec Compliance
216
+
217
+ | Requirement | Status |
218
+ |-------------------------------------|--------|
219
+ | Typed models (Action/Observation/State) | βœ… |
220
+ | `step()` / `reset()` / `state()` API | βœ… |
221
+ | `openenv.yaml` manifest | βœ… |
222
+ | 3+ tasks with graders (easyβ†’hard) | βœ… |
223
+ | Reward range strictly (0.0, 1.0) | βœ… |
224
+ | Baseline inference with reproducible scores | βœ… |
225
+ | Dockerfile builds | βœ… |
226
+ | Deployed on HF Spaces | βœ… |
227
+ | Structured `[START]/[STEP]/[END]` logs | βœ… |
228
+
229
+ ---
230
+
231
+ *Built for the Meta PyTorch Hackathon x Scaler School of Technology β€” Round 1*