Bug Hunting Loop

Objective

Find, validate, and report likely bugs with reproducible evidence instead of filing speculative agent-generated issues.

Trigger

Schedule: weekly on active modules.
Event: error logs spike, flaky tests cluster, user reports mention the same behavior, or a release branch opens.
Manual bootstrap/debug command: "hunt for reproducible bugs in this module."

Intake

Recent errors, flaky tests, issue labels, support snippets, changed files, code ownership, logs, traces, and module documentation.
Existing bug reports and duplicate issue search.
Safe reproduction commands and test fixtures.

Agents

Scout: discovers suspicious signals and likely affected code paths.
Reproducer: attempts minimal reproduction in a safe environment.
Minimizer: reduces the reproduction to the smallest failing case.
Fix suggester: proposes a patch only when the cause is clear.
Reporter: files evidence-backed issues or PRs.

Workspace And Permissions

Use a branch, worktree, sandbox, or read-only mode depending on the target.
Allow tests, local fixtures, logs, static search, and non-production reproduction.
Disallow production data access, destructive fuzzing, speculative mass issue creation, or broad refactors.

Durable State

Checked modules, signals inspected, duplicate issue search, reproduction steps, commands, expected/actual behavior, traces, screenshots, and final disposition.

Loop Steps

Discover candidate bug signals from tests, logs, issues, traces, or recent diffs.
Load ownership docs, existing issues, and prior bug-hunt state.
Delegate signal discovery, reproduction, minimization, patch proposal, and reporting.
Search for duplicates before filing anything.
Reproduce in the smallest safe environment.
If root cause is obvious and patch is small, propose a PR with tests.
Otherwise file a precise issue with evidence and stop.
Persist false positives and checked areas.

Verification Gates

A bug report includes reproducible steps or a clear trace/log link.
A patch includes a failing test or deterministic reproduction when feasible.
Duplicate issue search is recorded.
Expected vs actual behavior is grounded in docs, tests, or product requirements.

Budget And Exit

Max retries: 3 reproduction attempts per candidate.
Max runtime: 90 minutes per module or signal cluster.
Stop when a bug is reproduced and reported, a small verified patch is opened, the signal is classified as non-bug, or owner judgment is needed.

Escalation

Escalate for production-only bugs, privacy-sensitive logs, ambiguous product behavior, security-sensitive findings, data loss, or broad architectural fixes.

Loop Instruction

Hunt for reproducible bugs in <module, release, or signal cluster>.
Start from concrete signals: failing tests, logs, traces, issues, or recent changes.
Search for duplicates before filing.
Reproduce safely, minimize the case, and report expected vs actual behavior.
Open a patch only when the cause is clear and verification is available.

Example automation: run weekly against modules with recent churn, flaky tests, or repeated user reports.

Failure Modes

Filing issues from code smell without reproduction.
Creating duplicate bug reports.
Using private logs or customer data as public evidence.
Patching symptoms while leaving the reproduced cause unexplained.

References

Run long horizon tasks with Codex - Practical plan-edit-test-observe-repair-document-repeat runbook.
SWE-bench - Benchmark framing around real repository issues and tests.
Terminal-Bench - Evaluation context for hard terminal tasks and reproducibility.