headroom / scripts /README.md
adryanev's picture
feat(scripts): add Codex proxy reconnect-storm repro harness
5859c84

scripts/

Utility scripts bundled with the Headroom repo. Most are one-off operator tools; a few are runnable as part of development workflows.

Reproducing the reconnect storm

repro_codex_replay.py reproduces the multi-agent Codex reconnect/retry storm against a local Headroom proxy (default http://127.0.0.1:8787), as described in wiki/plans/2026-04-17-codex-proxy-runtime-analysis.md under "Latest Correction". Use it to:

  • Regression-check that /livez stays responsive under a cold-start storm.
  • Empirically tune the Unit 4 pre-upstream semaphore default (HEADROOM_ANTHROPIC_PRE_UPSTREAM_CONCURRENCY).
  • Exercise the Codex WS lifecycle + Anthropic HTTP path simultaneously without needing to replay captured production traffic.

Run

# Default: 8 WS + 4 HTTP clients, 30s storm, p99 /livez must stay <= 500ms.
python scripts/repro_codex_replay.py

# Tighter budget, shorter run:
python scripts/repro_codex_replay.py \
    --url http://127.0.0.1:8787 \
    --ws-clients 16 \
    --anthropic-clients 8 \
    --duration 60 \
    --livez-threshold-ms 100

# Dump the full summary as JSON for downstream tooling:
python scripts/repro_codex_replay.py --json

Exit code:

  • 0 — warmup succeeded (or was skipped), storm ran for the requested duration, and /livez p99 stayed under --livez-threshold-ms.
  • 1 — soft assertion failed, proxy unreachable, or unhandled exception. Proxy-unreachable is detected and reported within ~5 seconds.

Fixtures

The script loads two hand-crafted, fully synthetic JSON fixtures:

  • scripts/fixtures/anthropic_replay_body.json — shape of a large agent reconnect replay /v1/messages?beta=true POST body.
  • scripts/fixtures/codex_response_create_frame.json — first Codex WS frame with the {"type": "response.create", "response": {...}} envelope.

Override via --ws-frame-fixture / --anthropic-body-fixture if you have captured traffic to replay instead.

Interpretation

  • /livez p99 under threshold means the event loop is not starved during the storm. If it rises with the semaphore unbounded (HEADROOM_ANTHROPIC_PRE_UPSTREAM_CONCURRENCY=10000) and drops back under the default, Unit 4's backpressure is working.
  • Codex WS: opened should equal --ws-clients. response.completed typically stays low when upstream auth isn't configured locally — the goal is handshake + relay wiring, not real upstream traffic.
  • Anthropic HTTP: ok_2xx + non_2xx + timed_out + errors should roughly equal attempted. Sustained non-zero timed_out during the storm is the failure signal the plan targets.

A smoke test at tests/test_scripts/test_repro_codex_replay_smoke.py exercises the script against a mock FastAPI server on every PR.

Install scripts

  • install.sh — POSIX installer.
  • install.ps1 — Windows PowerShell installer.

These are generated by the release pipeline; edit with care.