Siteshcodes commited on
Commit
c8556a4
Β·
unverified Β·
1 Parent(s): c700066

Revise task descriptions and update script references

Browse files

Updated README.md to clarify task descriptions, update scoring details, and correct script names.

Files changed (1) hide show
  1. README.md +44 -33
README.md CHANGED
@@ -25,43 +25,41 @@ actual engineering judgment.
25
 
26
  ## Action space
27
 
28
- | Field | Type | Values |
29
- |-----------------|-------------|---------------------------------------------|
30
- | `priority` | string | `P0` `P1` `P2` `P3` |
31
- | `labels` | list[str] | `bug` `performance` `security` `ux` `docs`… |
32
- | `assigned_team` | string | `backend` `frontend` `infra` `security` `devx` |
33
- | `milestone` | string | `hotfix` `v2.1` `backlog` |
34
- | `reasoning` | string | Free-form explanation |
35
 
36
  ## Observation space
37
 
38
- | Field | Type | Description |
39
- |--------------|-------------|------------------------------------------|
40
- | `bug_report` | BugReport | Title, body, author, comments |
41
- | `task_id` | string | Current difficulty: easy / medium / hard |
42
- | `score` | float | Cumulative score this episode |
43
- | `reward` | float | Reward from last action (0.0–1.0) |
44
- | `feedback` | string | Human-readable grader feedback |
45
- | `done` | bool | Episode complete flag |
46
 
47
  ## Tasks
48
 
49
  ### Task 1 β€” Easy (Priority labeling)
50
  Agent assigns a single P0–P3 priority to a bug report.
51
  - Grader: exact match = 1.0, one level off = 0.5, else 0.0
52
- - Expected baseline score: ~0.75
53
 
54
  ### Task 2 β€” Medium (Priority + label classification)
55
- Agent assigns priority AND a set of category labels.
56
- - Grader: 50% priority score + 50% Jaccard label similarity
57
- - Expected baseline score: ~0.60
58
 
59
  ### Task 3 β€” Hard (Full triage)
60
  Agent must assign priority, labels, team, and milestone.
61
  Security escalation failures are penalized.
62
- - Grader: 40% priority + 35% labels + 25% team routing
63
  - Penalty: βˆ’0.15 for missing security escalation
64
- - Expected baseline score: ~0.45
65
 
66
  ## Reward function
67
 
@@ -87,25 +85,37 @@ docker build -t bug-triage-env .
87
  docker run -p 7860:7860 bug-triage-env
88
  ```
89
 
90
- ### Run baseline
91
  ```bash
92
- pip install groq openenv-core websockets
 
 
 
 
 
 
 
 
 
 
93
  export GROQ_API_KEY=your_key_here
94
  python baseline.py
95
  ```
96
-
97
  Get a free Groq API key at [console.groq.com](https://console.groq.com).
98
 
99
  ## Baseline scores
100
 
101
- Evaluated with `llama-3.3-70b-versatile` via Groq (temperature=0):
 
 
 
 
 
 
 
 
102
 
103
- | Task | Score |
104
- |--------|-------|
105
- | Easy | ~0.75 |
106
- | Medium | ~0.60 |
107
- | Hard | ~0.45 |
108
- | **Avg**| **~0.60** |
109
 
110
  ## Project structure
111
  ```
@@ -117,8 +127,9 @@ bug-triage-env/
117
  β”‚ └── requirements.txt
118
  β”œβ”€β”€ model.py # Dataclass models
119
  β”œβ”€β”€ client.py # WebSocket client
120
- β”œβ”€β”€ baseline.py # Groq inference script
 
121
  β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
122
  β”œβ”€β”€ Dockerfile
123
  └── README.md
124
- ```
 
25
 
26
  ## Action space
27
 
28
+ | Field | Type | Values |
29
+ |---|---|---|
30
+ | `priority` | string | `P0` `P1` `P2` `P3` |
31
+ | `labels` | list[str] | `bug` `performance` `security` `ux` `docs`… |
32
+ | `assigned_team` | string | `backend` `frontend` `infra` `security` `devx` |
33
+ | `milestone` | string | `hotfix` `v2.1` `backlog` |
34
+ | `reasoning` | string | Free-form explanation |
35
 
36
  ## Observation space
37
 
38
+ | Field | Type | Description |
39
+ |---|---|---|
40
+ | `bug_report` | BugReport | Title, body, author, comments |
41
+ | `task_id` | string | Current difficulty: easy / medium / hard |
42
+ | `score` | float | Cumulative score this episode |
43
+ | `reward` | float | Reward from last action (0.0–1.0) |
44
+ | `feedback` | string | Human-readable grader feedback |
45
+ | `done` | bool | Episode complete flag |
46
 
47
  ## Tasks
48
 
49
  ### Task 1 β€” Easy (Priority labeling)
50
  Agent assigns a single P0–P3 priority to a bug report.
51
  - Grader: exact match = 1.0, one level off = 0.5, else 0.0
52
+ - Grader weight: priority 100%
53
 
54
  ### Task 2 β€” Medium (Priority + label classification)
55
+ Agent assigns priority AND a set of category labels AND team routing.
56
+ - Grader: priority 45% + label Jaccard similarity 40% + team routing 15%
 
57
 
58
  ### Task 3 β€” Hard (Full triage)
59
  Agent must assign priority, labels, team, and milestone.
60
  Security escalation failures are penalized.
61
+ - Grader: priority 35% + labels 30% + team 20% + milestone 15%
62
  - Penalty: βˆ’0.15 for missing security escalation
 
63
 
64
  ## Reward function
65
 
 
85
  docker run -p 7860:7860 bug-triage-env
86
  ```
87
 
88
+ ### Run inference (hackathon submission script)
89
  ```bash
90
+ pip install openai openenv-core
91
+ export API_BASE_URL=https://router.huggingface.co/v1
92
+ export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
93
+ export HF_TOKEN=your_hf_token_here
94
+ export ENV_BASE_URL=https://siteshcodes-bug-triage-env.hf.space
95
+ python inference.py
96
+ ```
97
+
98
+ ### Run baseline (development script)
99
+ ```bash
100
+ pip install groq openenv-core
101
  export GROQ_API_KEY=your_key_here
102
  python baseline.py
103
  ```
 
104
  Get a free Groq API key at [console.groq.com](https://console.groq.com).
105
 
106
  ## Baseline scores
107
 
108
+ Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router
109
+ (temperature=0):
110
+
111
+ | Task | Score |
112
+ |---|---|
113
+ | Easy | 0.000 |
114
+ | Medium | 0.000 |
115
+ | Hard | 0.500 |
116
+ | **Avg** | **0.167** |
117
 
118
+ > Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
 
 
 
 
 
119
 
120
  ## Project structure
121
  ```
 
127
  β”‚ └── requirements.txt
128
  β”œβ”€β”€ model.py # Dataclass models
129
  β”œβ”€β”€ client.py # WebSocket client
130
+ β”œβ”€β”€ baseline.py # Groq development script
131
+ β”œβ”€β”€ inference.py # OpenAI client submission script
132
  β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
133
  β”œβ”€β”€ Dockerfile
134
  └── README.md
135
+ ```