RAHUL-13 commited on
Commit
df5ec5d
Β·
verified Β·
1 Parent(s): 1f335e6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md CHANGED
@@ -1,4 +1,140 @@
1
  ---
2
  title: Bug Report Structuring Env
 
 
 
3
  sdk: docker
 
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Bug Report Structuring Env
3
+ emoji: "\U0001F41B"
4
+ colorFrom: red
5
+ colorTo: yellow
6
  sdk: docker
7
+ pinned: false
8
  ---
9
+
10
+ # Bug Report Structuring Environment
11
+
12
+ An **OpenEnv** environment that challenges LLM agents to convert messy, unstructured bug reports into well-organized, structured formats.
13
+
14
+ ## Overview
15
+
16
+ Bug reports in the wild are often poorly written β€” missing steps, ambiguous descriptions, wrong severity labels, and scattered technical details. This environment tests an LLM agent's ability to:
17
+
18
+ 1. **Extract** key information from noisy text
19
+ 2. **Classify** severity accurately based on impact
20
+ 3. **Structure** reproduction steps in a clear, actionable format
21
+ 4. **Identify** environment details (OS, browser, versions)
22
+ 5. **Handle** compound reports with multiple distinct issues
23
+
24
+ ## Tasks
25
+
26
+ | Task | Difficulty | Max Steps | Description |
27
+ |------|-----------|-----------|-------------|
28
+ | `easy` | 🟒 Easy | 3 | Single clear bug, all info present but messy |
29
+ | `medium` | 🟑 Medium | 4 | Multiple symptoms, ambiguity, partial info |
30
+ | `hard` | πŸ”΄ Hard | 5 | Multiple distinct bugs, technical details |
31
+
32
+ ## API Endpoints
33
+
34
+ | Method | Endpoint | Description |
35
+ |--------|----------|-------------|
36
+ | `POST` | `/reset` | Start a new episode with `{"task_id": "easy\|medium\|hard"}` |
37
+ | `POST` | `/step` | Submit structured report, get score + feedback |
38
+ | `GET` | `/state` | Get current episode metadata |
39
+ | `GET` | `/health` | Health check |
40
+ | `GET` | `/docs` | Interactive API documentation |
41
+
42
+ ## Action Schema
43
+
44
+ The agent submits a structured bug report as JSON:
45
+
46
+ ```json
47
+ {
48
+ "title": "Clear, concise bug title",
49
+ "steps_to_reproduce": "1. Step one\n2. Step two\n...",
50
+ "expected_behavior": "What should happen",
51
+ "actual_behavior": "What actually happens",
52
+ "severity": "low|medium|high|critical",
53
+ "environment": "OS, browser, version info",
54
+ "additional_notes": "Any other relevant details"
55
+ }
56
+ ```
57
+
58
+ ## Scoring
59
+
60
+ Reports are graded on 7 dimensions (each 0.0–1.0):
61
+
62
+ | Dimension | Weight | What's Evaluated |
63
+ |-----------|--------|------------------|
64
+ | Title | 15% | Clarity and descriptiveness |
65
+ | Steps to Reproduce | 25% | Completeness and specificity |
66
+ | Expected Behavior | 15% | Accuracy of expected state |
67
+ | Actual Behavior | 15% | Accuracy of reported symptoms |
68
+ | Severity | 15% | Correct classification |
69
+ | Environment | 10% | Platform/version extraction |
70
+ | Format | 5% | Structural completeness |
71
+
72
+ **Partial credit** is awarded based on keyword coverage β€” you don't need a perfect match to earn points.
73
+
74
+ ## Quick Start
75
+
76
+ ### Run Locally
77
+
78
+ ```bash
79
+ pip install -r requirements.txt
80
+ python app.py
81
+ # Server runs at http://localhost:7860
82
+ ```
83
+
84
+ ### Docker
85
+
86
+ ```bash
87
+ docker build -t bug-report-env .
88
+ docker run -p 7860:7860 bug-report-env
89
+ ```
90
+
91
+ ### Run Inference
92
+
93
+ ```bash
94
+ export API_BASE_URL="https://api-inference.huggingface.co/v1"
95
+ export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
96
+ export HF_TOKEN="hf_your_token_here"
97
+ export ENV_URL="https://your-space.hf.space"
98
+
99
+ python inference.py
100
+ ```
101
+
102
+ ## Project Structure
103
+
104
+ ```
105
+ β”œβ”€β”€ app.py # FastAPI server with all endpoints
106
+ β”œβ”€β”€ environment.py # Core environment logic (reset/step/state)
107
+ β”œβ”€β”€ models.py # Pydantic request/response models
108
+ β”œβ”€β”€ tasks.py # Task definitions with ground truth
109
+ β”œβ”€β”€ graders.py # Deterministic grading logic
110
+ β”œβ”€β”€ inference.py # LLM agent inference script
111
+ β”œβ”€β”€ openenv.yaml # OpenEnv environment manifest
112
+ β”œβ”€β”€ Dockerfile # Container definition for HF Spaces
113
+ β”œβ”€β”€ requirements.txt # Python dependencies
114
+ └── README.md # This file
115
+ ```
116
+
117
+ ## Environment Variables
118
+
119
+ | Variable | Description | Required |
120
+ |----------|-------------|----------|
121
+ | `API_BASE_URL` | LLM API base URL | For inference |
122
+ | `MODEL_NAME` | LLM model identifier | For inference |
123
+ | `HF_TOKEN` | Hugging Face token | For inference |
124
+ | `ENV_URL` | Deployed environment URL | For inference |
125
+ | `PORT` | Server port (default: 7860) | Optional |
126
+
127
+ ## Deployment
128
+
129
+ This environment is designed for deployment on **Hugging Face Spaces** using Docker SDK:
130
+
131
+ 1. Create a new Space on Hugging Face (Docker SDK)
132
+ 2. Push the project files
133
+ 3. The Space will build and serve automatically on port 7860
134
+
135
+ ## Technical Details
136
+
137
+ - **No external dependencies**: The grading is fully deterministic using keyword matching β€” no LLM needed server-side
138
+ - **Concurrent sessions**: Supports multiple simultaneous agents
139
+ - **Reward shaping**: First step gets full score as reward; subsequent steps reward improvement only
140
+ - **Runtime**: Well under the 20-minute limit on 2 vCPU / 8GB RAM