Bug Hunting Loop
Objective
Find, validate, and report likely bugs with reproducible evidence instead of filing speculative agent-generated issues.
Trigger
- Schedule: weekly on active modules.
- Event: error logs spike, flaky tests cluster, user reports mention the same behavior, or a release branch opens.
- Manual bootstrap/debug command: "hunt for reproducible bugs in this module."
Intake
- Recent errors, flaky tests, issue labels, support snippets, changed files, code ownership, logs, traces, and module documentation.
- Existing bug reports and duplicate issue search.
- Safe reproduction commands and test fixtures.
Agents
- Scout: discovers suspicious signals and likely affected code paths.
- Reproducer: attempts minimal reproduction in a safe environment.
- Minimizer: reduces the reproduction to the smallest failing case.
- Fix suggester: proposes a patch only when the cause is clear.
- Reporter: files evidence-backed issues or PRs.
Workspace And Permissions
- Use a branch, worktree, sandbox, or read-only mode depending on the target.
- Allow tests, local fixtures, logs, static search, and non-production reproduction.
- Disallow production data access, destructive fuzzing, speculative mass issue creation, or broad refactors.
Durable State
- Checked modules, signals inspected, duplicate issue search, reproduction steps, commands, expected/actual behavior, traces, screenshots, and final disposition.
Loop Steps
- Discover candidate bug signals from tests, logs, issues, traces, or recent diffs.
- Load ownership docs, existing issues, and prior bug-hunt state.
- Delegate signal discovery, reproduction, minimization, patch proposal, and reporting.
- Search for duplicates before filing anything.
- Reproduce in the smallest safe environment.
- If root cause is obvious and patch is small, propose a PR with tests.
- Otherwise file a precise issue with evidence and stop.
- Persist false positives and checked areas.
Verification Gates
- A bug report includes reproducible steps or a clear trace/log link.
- A patch includes a failing test or deterministic reproduction when feasible.
- Duplicate issue search is recorded.
- Expected vs actual behavior is grounded in docs, tests, or product requirements.
Budget And Exit
- Max retries: 3 reproduction attempts per candidate.
- Max runtime: 90 minutes per module or signal cluster.
- Stop when a bug is reproduced and reported, a small verified patch is opened, the signal is classified as non-bug, or owner judgment is needed.
Escalation
Escalate for production-only bugs, privacy-sensitive logs, ambiguous product behavior, security-sensitive findings, data loss, or broad architectural fixes.
Loop Instruction
Hunt for reproducible bugs in <module, release, or signal cluster>.
Start from concrete signals: failing tests, logs, traces, issues, or recent changes.
Search for duplicates before filing.
Reproduce safely, minimize the case, and report expected vs actual behavior.
Open a patch only when the cause is clear and verification is available.
Example automation: run weekly against modules with recent churn, flaky tests, or repeated user reports.
Failure Modes
- Filing issues from code smell without reproduction.
- Creating duplicate bug reports.
- Using private logs or customer data as public evidence.
- Patching symptoms while leaving the reproduced cause unexplained.
References
- Run long horizon tasks with Codex - Practical plan-edit-test-observe-repair-document-repeat runbook.
- SWE-bench - Benchmark framing around real repository issues and tests.
- Terminal-Bench - Evaluation context for hard terminal tasks and reproducibility.