Spaces:

savetrees
/

bug-triage-openenv

Running

App Files Files Community

bug-triage-openenv / docs /NOT_TO_DO.md

savetrees

Upload folder using huggingface_hub

0135a17 verified 2 days ago

preview code

raw

history blame contribute delete

3.12 kB

NOT_TO_DO.md — Bug Triage Environment Anti-Patterns

Violations here = disqualification or severe score penalty.

Disqualification Risks

Hardcoded grader scores — Grader must evaluate actual agent output against ground truth. If every episode returns the same score, DQ.
Missing endpoints — /tasks, /grader, /baseline are mandatory hackathon endpoints. Missing any = DQ.
Non-running baseline — baseline.py must execute without error. If it crashes on import or at runtime = DQ.
Docker build failure — Dockerfile must build cleanly. Always test docker build before submission.
/reset returns error — This is the first thing evaluators test.

Technical Anti-Patterns

Don't	Why	Do Instead
Expose `ground_truth` in observation	Agent sees answers → invalid training	Only return bug report fields
Use hardcoded bug selection	Same bug every episode → DQ	`random.choice(self._bugs)`
Return reward outside [-0.5, 1.0]	Breaks GRPO training stability	Clamp: `max(-0.5, min(1.0, reward))`
Grader score outside [0.0, 1.0]	Violates spec	`max(0.0, min(1.0, score))`
Skip Pydantic validation	Runtime crashes in production	All models use `BaseModel`
Use mutable default args	Shared state bugs	`Field(default_factory=...)`
Forget `model_config = {"use_enum_values": True}`	Enums serialize as objects instead of strings	Set on BugTriageAction

Bug Triage Specific Don'ts

Don't	Why
Accept any string for `bug_type`	Use `BugType` enum — only 6 valid values
Accept any developer name	Validate against `DEVELOPERS` list
Mix up reward vs grader	Reward = training signal. Grader = eval score. Different ranges.
Build a classifier model	We're building an RL environment, not a model
Train during hackathon eval	Environment must serve episodes, not train

Pre-Submission Validation (Run Before Submitting)

# 1. Health check
curl http://localhost:8000/health
# → {"status":"healthy"}

# 2. Reset all 3 tasks
curl -X POST http://localhost:8000/reset -d '{"task_id":"task_1"}'
curl -X POST http://localhost:8000/reset -d '{"task_id":"task_2"}'
curl -X POST http://localhost:8000/reset -d '{"task_id":"task_3"}'

# 3. Tasks endpoint
curl http://localhost:8000/tasks
# → 3 tasks with action schemas

# 4. Full episode (reset → step → grader)
EP=$(curl -s -X POST http://localhost:8000/reset -d '{"task_id":"task_1"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['episode_id'])")
curl -X POST http://localhost:8000/step -d "{\"episode_id\":\"$EP\",\"action\":{\"task_id\":\"task_1\",\"bug_type\":\"crash\"}}"
curl -X POST http://localhost:8000/grader -d "{\"episode_id\":\"$EP\",\"task_id\":\"task_1\"}"

# 5. Baseline
GEMINI_API_KEY="..." python -m bug_triage_env.baseline --all-tasks --episodes 3

# 6. Docker build
docker build -f bug_triage_env/server/Dockerfile -t bug-triage-env .
docker run -d -p 8000:8000 bug-triage-env
curl http://localhost:8000/health