Spaces:

RAHUL-13
/

bug-report-structuring-env

Sleeping

App Files Files Community

bug-report-structuring-env / README.md

RAHUL-13

Upload README.md with huggingface_hub

8d21664 verified 2 months ago

preview code

raw

history blame contribute delete

6.34 kB

metadata

title: Bug Report Structuring Env
emoji: 🐛
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false

Bug Report Structuring Environment

An OpenEnv environment that challenges LLM agents to convert messy, unstructured bug reports into well-organized, structured formats.

Overview

Bug reports in the wild are often poorly written — missing steps, ambiguous descriptions, wrong severity labels, and scattered technical details. This environment tests an LLM agent's ability to:

Extract key information from noisy text
Classify severity accurately based on impact
Structure reproduction steps in a clear, actionable format
Identify environment details (OS, browser, versions)
Handle compound reports with multiple distinct issues

Tasks

Task	Difficulty	Max Steps	Description
`easy`	🟢 Easy	3	Single clear bug, all info present but messy
`medium`	🟡 Medium	4	Multiple symptoms, ambiguity, partial info
`hard`	🔴 Hard	5	Multiple distinct bugs, technical details

API Endpoints

Method	Endpoint	Description
`POST`	`/reset`	Start a new episode with `{"task_id": "easy\|medium\|hard"}`
`POST`	`/step`	Submit structured report, get score + feedback
`GET`	`/state`	Get current episode metadata
`GET`	`/health`	Health check
`GET`	`/docs`	Interactive API documentation

Action Space

The agent submits a structured bug report as a JSON object via POST /step:

{
  "action": {
    "title": "Clear, concise bug title",
    "steps_to_reproduce": "1. Step one\n2. Step two\n...",
    "expected_behavior": "What should happen",
    "actual_behavior": "What actually happens",
    "severity": "low|medium|high|critical",
    "environment": "OS, browser, version info",
    "additional_notes": "Any other relevant details"
  }
}

Field	Type	Description
`title`	string	Clear, concise summary of the bug
`steps_to_reproduce`	string	Numbered step-by-step reproduction instructions
`expected_behavior`	string	What the correct behavior should be
`actual_behavior`	string	What actually happens (the bug)
`severity`	string	One of: `low`, `medium`, `high`, `critical`
`environment`	string	OS, browser, version, platform details
`additional_notes`	string	Any other relevant information

Observation Space

After each reset() or step(), the environment returns an observation:

{
  "raw_report": "The messy, unstructured bug report text...",
  "feedback": "Grading feedback explaining the score",
  "score": 0.85,
  "field_scores": {
    "title": 1.0,
    "steps_to_reproduce": 0.75,
    "expected_behavior": 0.5,
    "actual_behavior": 0.8,
    "severity": 1.0,
    "environment": 1.0,
    "format": 0.83
  },
  "done": false,
  "reward": 0.85,
  "step_count": 1,
  "task_id": "easy",
  "max_steps": 3
}

Field	Type	Description
`raw_report`	string	The original messy bug report to structure
`feedback`	string	Human-readable grading feedback
`score`	float	Overall score from 0.0 to 1.0
`field_scores`	dict	Per-field scores (0.0–1.0 each)
`done`	bool	Whether the episode is complete
`reward`	float	Reward signal for this step
`step_count`	int	Current step number
`task_id`	string	Current task identifier
`max_steps`	int	Maximum steps allowed

Scoring

Reports are graded on 7 dimensions (each 0.0–1.0):

Dimension	Weight	What's Evaluated
Title	15%	Clarity and descriptiveness
Steps to Reproduce	25%	Completeness and specificity
Expected Behavior	15%	Accuracy of expected state
Actual Behavior	15%	Accuracy of reported symptoms
Severity	15%	Correct classification
Environment	10%	Platform/version extraction
Format	5%	Structural completeness

Partial credit is awarded based on keyword coverage — you don't need a perfect match to earn points.

Quick Start

Run Locally

pip install -r requirements.txt
python app.py
# Server runs at http://localhost:7860

Docker

docker build -t bug-report-env .
docker run -p 7860:7860 bug-report-env

Run Inference

export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_URL="https://your-space.hf.space"

python inference.py

Project Structure

├── app.py              # FastAPI server with all endpoints
├── environment.py      # Core environment logic (reset/step/state)
├── models.py           # Pydantic request/response models
├── tasks.py            # Task definitions with ground truth
├── graders.py          # Deterministic grading logic
├── inference.py        # LLM agent inference script
├── openenv.yaml        # OpenEnv environment manifest
├── Dockerfile          # Container definition for HF Spaces
├── requirements.txt    # Python dependencies
└── README.md           # This file

Environment Variables

Variable	Description	Required
`API_BASE_URL`	LLM API base URL	For inference
`MODEL_NAME`	LLM model identifier	For inference
`HF_TOKEN`	Hugging Face token	For inference
`ENV_URL`	Deployed environment URL	For inference
`PORT`	Server port (default: 7860)	Optional

Deployment

This environment is designed for deployment on Hugging Face Spaces using Docker SDK:

Create a new Space on Hugging Face (Docker SDK)
Push the project files
The Space will build and serve automatically on port 7860

Technical Details

No external dependencies: The grading is fully deterministic using keyword matching — no LLM needed server-side
Concurrent sessions: Supports multiple simultaneous agents
Reward shaping: First step gets full score as reward; subsequent steps reward improvement only
Runtime: Well under the 20-minute limit on 2 vCPU / 8GB RAM