Spaces:
Sleeping
Sleeping
File size: 6,341 Bytes
1f335e6 df5ec5d 1f335e6 df5ec5d 1f335e6 df5ec5d 8d21664 df5ec5d 8d21664 df5ec5d 8d21664 df5ec5d 8d21664 df5ec5d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | ---
title: Bug Report Structuring Env
emoji: "\U0001F41B"
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
---
# Bug Report Structuring Environment
An **OpenEnv** environment that challenges LLM agents to convert messy, unstructured bug reports into well-organized, structured formats.
## Overview
Bug reports in the wild are often poorly written β missing steps, ambiguous descriptions, wrong severity labels, and scattered technical details. This environment tests an LLM agent's ability to:
1. **Extract** key information from noisy text
2. **Classify** severity accurately based on impact
3. **Structure** reproduction steps in a clear, actionable format
4. **Identify** environment details (OS, browser, versions)
5. **Handle** compound reports with multiple distinct issues
## Tasks
| Task | Difficulty | Max Steps | Description |
|------|-----------|-----------|-------------|
| `easy` | π’ Easy | 3 | Single clear bug, all info present but messy |
| `medium` | π‘ Medium | 4 | Multiple symptoms, ambiguity, partial info |
| `hard` | π΄ Hard | 5 | Multiple distinct bugs, technical details |
## API Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode with `{"task_id": "easy\|medium\|hard"}` |
| `POST` | `/step` | Submit structured report, get score + feedback |
| `GET` | `/state` | Get current episode metadata |
| `GET` | `/health` | Health check |
| `GET` | `/docs` | Interactive API documentation |
## Action Space
The agent submits a structured bug report as a JSON object via `POST /step`:
```json
{
"action": {
"title": "Clear, concise bug title",
"steps_to_reproduce": "1. Step one\n2. Step two\n...",
"expected_behavior": "What should happen",
"actual_behavior": "What actually happens",
"severity": "low|medium|high|critical",
"environment": "OS, browser, version info",
"additional_notes": "Any other relevant details"
}
}
```
| Field | Type | Description |
|-------|------|-------------|
| `title` | string | Clear, concise summary of the bug |
| `steps_to_reproduce` | string | Numbered step-by-step reproduction instructions |
| `expected_behavior` | string | What the correct behavior should be |
| `actual_behavior` | string | What actually happens (the bug) |
| `severity` | string | One of: `low`, `medium`, `high`, `critical` |
| `environment` | string | OS, browser, version, platform details |
| `additional_notes` | string | Any other relevant information |
## Observation Space
After each `reset()` or `step()`, the environment returns an observation:
```json
{
"raw_report": "The messy, unstructured bug report text...",
"feedback": "Grading feedback explaining the score",
"score": 0.85,
"field_scores": {
"title": 1.0,
"steps_to_reproduce": 0.75,
"expected_behavior": 0.5,
"actual_behavior": 0.8,
"severity": 1.0,
"environment": 1.0,
"format": 0.83
},
"done": false,
"reward": 0.85,
"step_count": 1,
"task_id": "easy",
"max_steps": 3
}
```
| Field | Type | Description |
|-------|------|-------------|
| `raw_report` | string | The original messy bug report to structure |
| `feedback` | string | Human-readable grading feedback |
| `score` | float | Overall score from 0.0 to 1.0 |
| `field_scores` | dict | Per-field scores (0.0β1.0 each) |
| `done` | bool | Whether the episode is complete |
| `reward` | float | Reward signal for this step |
| `step_count` | int | Current step number |
| `task_id` | string | Current task identifier |
| `max_steps` | int | Maximum steps allowed |
## Scoring
Reports are graded on 7 dimensions (each 0.0β1.0):
| Dimension | Weight | What's Evaluated |
|-----------|--------|------------------|
| Title | 15% | Clarity and descriptiveness |
| Steps to Reproduce | 25% | Completeness and specificity |
| Expected Behavior | 15% | Accuracy of expected state |
| Actual Behavior | 15% | Accuracy of reported symptoms |
| Severity | 15% | Correct classification |
| Environment | 10% | Platform/version extraction |
| Format | 5% | Structural completeness |
**Partial credit** is awarded based on keyword coverage β you don't need a perfect match to earn points.
## Quick Start
### Run Locally
```bash
pip install -r requirements.txt
python app.py
# Server runs at http://localhost:7860
```
### Docker
```bash
docker build -t bug-report-env .
docker run -p 7860:7860 bug-report-env
```
### Run Inference
```bash
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_URL="https://your-space.hf.space"
python inference.py
```
## Project Structure
```
βββ app.py # FastAPI server with all endpoints
βββ environment.py # Core environment logic (reset/step/state)
βββ models.py # Pydantic request/response models
βββ tasks.py # Task definitions with ground truth
βββ graders.py # Deterministic grading logic
βββ inference.py # LLM agent inference script
βββ openenv.yaml # OpenEnv environment manifest
βββ Dockerfile # Container definition for HF Spaces
βββ requirements.txt # Python dependencies
βββ README.md # This file
```
## Environment Variables
| Variable | Description | Required |
|----------|-------------|----------|
| `API_BASE_URL` | LLM API base URL | For inference |
| `MODEL_NAME` | LLM model identifier | For inference |
| `HF_TOKEN` | Hugging Face token | For inference |
| `ENV_URL` | Deployed environment URL | For inference |
| `PORT` | Server port (default: 7860) | Optional |
## Deployment
This environment is designed for deployment on **Hugging Face Spaces** using Docker SDK:
1. Create a new Space on Hugging Face (Docker SDK)
2. Push the project files
3. The Space will build and serve automatically on port 7860
## Technical Details
- **No external dependencies**: The grading is fully deterministic using keyword matching β no LLM needed server-side
- **Concurrent sessions**: Supports multiple simultaneous agents
- **Reward shaping**: First step gets full score as reward; subsequent steps reward improvement only
- **Runtime**: Well under the 20-minute limit on 2 vCPU / 8GB RAM
|