Pete Dunn
Document canary bootstrap and manual HF stamping
1d82fc3

Masters Telecom Toolkit Runbook

This runbook is the operator surface for local verification, Hugging Face deployment, hosted validation, rollback, and Auth0/runtime drift handling.

Source of truth

  • Architecture: docs/architecture.md
  • HF/Auth0 operator checklist: docs/hf_auth0_operator_checklist.md
  • Repo rules for Codex and contributors: AGENTS.md
  • Canonical eval guidance: docs/evals/README.md
  • Release log: docs/dev/release_log.md

Local verify bar

Run these before merge for meaningful app, auth, retrieval, or deployment changes.

cd backend && bash scripts/test_backend.sh
cd frontend && npx tsc -p tsconfig.json --noEmit
cd frontend && npm run build

Run the stronger bar for backend contracts, auth, retrieval logic, model/env changes, or deploy workflow edits.

cd backend && bash scripts/test_backend.sh --full
cd backend && bash scripts/release_gate.sh

Local canary bootstrap

Use this when working from the canary stabilization worktree instead of the main worktree.

Minimum local prerequisites:

  1. Root env for backend gates:
ln -s /path/to/main/.env.codex /path/to/canary/.env.codex

backend/scripts/release_gate.sh auto-sources the canary worktree root .env.codex. If this file is missing, the gate will fail early on OPENAI_API_KEY.

  1. Shared Router RAG corpus path:
ln -s /path/to/main/_RAG_Ready_KB_Organized /path/to/canary/_RAG_Ready_KB_Organized

Without a local corpus path, backend verify and router RAG smoke can fail on missing rag_ingestion_chunks.jsonl.

  1. Hosted smoke env for the canary frontend:

Create frontend/.env.e2e with at least:

E2E_DISABLE_WEBSERVER=true
E2E_BASE_URL=https://crazycrazypete-masters-four-tab-openai-canary.hf.space
E2E_AUTH0_DOMAIN=<your-auth0-domain>
E2E_AUTH_TEST_EMAIL=<test-user-email>
E2E_AUTH_TEST_PASSWORD=<test-user-password>

frontend/scripts/run-hosted-smoke.sh checks for .env.e2e unless E2E_ENV_FILE is explicitly set.

Deploy path

Use .github/workflows/deploy-hf-gated.yml as the supported deploy path.

Pre-deploy expectations:

  • main contains the intended release.
  • Repo/workflow model defaults still align on gpt-5-mini.
  • Auth0 audience placeholders are absent.
  • Required GitHub repository secrets are present:
    • HF_TOKEN
    • HF_SPACE_ID
    • HF_SPACE_ID_CANARY if using canary
    • HF_USERNAME if needed
    • OPENAI_API_KEY
  • The gated deploy workflow actively clears forbidden legacy audience envs (AUTH0_AUDIENCE, VITE_AUTH0_AUDIENCE) from HF targets before stamping build metadata.

Deploy sequence:

  1. GitHub Actions runs auth/security tests, router regressions, and the RAG quality gate.
  2. Canary deploy pushes to the HF canary Space and stamps build metadata.
  3. Hosted validator checks /build-info plus /api/health when accessible for the expected build, startup integrity, and auth posture.
  4. Production deploy pushes the same ref to the production Space.
  5. Hosted validator repeats against production.
  6. Operator re-runs the minimal hosted smoke bundle on both production and canary:
cd frontend && ./scripts/run-hosted-smoke.sh
  1. Operator completes the HF/Auth0 checklist and signs off.
  2. Operator appends the deploy result to docs/dev/release_log.md.

If you do a manual HF git push for canary-only testing instead of the gated workflow, you must also restamp the canary Space variables:

  • MASTERS_TOOLKIT_BUILD_VERSION
  • MASTERS_TOOLKIT_GIT_SHA

Otherwise /build-info can report stale metadata even when the runtime SHA is current.

Hosted validation

Use this after any deploy, manual Space config change, or suspected HF/Auth0 drift.

python backend/scripts/validate_hosted_runtime.py \
  --base-url https://your-space.hf.space \
  --expected-build-version release-... \
  --expected-git-sha <sha> \
  --expect-auth-required true \
  --expect-auth-enabled true

What it validates:

  • hosted build version and git SHA
  • startup integrity status
  • /api/health overall status when the endpoint is publicly reachable
  • auth-required and auth-enabled posture
  • no effective use of forbidden legacy audiences
  • no forbidden Auth0 placeholder warning/detail text

Release log

After every successful deploy or rollback, add one entry to docs/dev/release_log.md.

Each entry should include:

  • date
  • GitHub Actions run id
  • deployed git SHA
  • build version
  • target lanes used: canary, production, or both
  • hosted validator result
  • hosted smoke result
  • any manual HF/Auth0 config changes made during the release

Rollback

Use the same deploy workflow with workflow_dispatch and rollback_ref.

Rollback sequence:

  1. Choose the last known-good commit or tag.
  2. Run Deploy HF (Gated Canary + Rollback) with rollback_ref=<known-good-ref>.
  3. Let canary validate first unless production recovery requires skip_canary=true.
  4. Re-run hosted validation against the rollback target.
  5. Re-run the HF/Auth0 operator checklist before declaring recovery complete.

Auth0 drift response

Use this when login fails, callback behavior changes, or hosted auth diagnostics report config issues.

Check in order:

  1. APP_BASE_URL matches the intended canonical host.
  2. VITE_APP_BASE_URL matches the same canonical host.
  3. AUTH0_DOMAIN, AUTH0_CLIENT_ID, and VITE_AUTH0_CLIENT_ID are correct for the SPA app.
  4. AUTH0_AUDIENCE and VITE_AUTH0_AUDIENCE are blank unless a real API Identifier exists.
  5. Auth0 allowed callback, logout, and web origin lists include the deployed host.
  6. Hosted Space was rebuilt after any frontend auth env change.
  7. Credentialed auth smoke still succeeds on the intended target.

Model or env drift response

Use this when hosted behavior does not match repo decisions.

Check in order:

  1. GitHub workflow env pins still use gpt-5-mini.
  2. Hugging Face Space variables do not override repo defaults with stale model values.
  3. Build metadata on /build-info matches the expected deploy ref.
  4. /build-info shows clean startup and auth posture, and the reported APP_BASE_URL / VITE_APP_BASE_URL origins match the deployed host.
  5. /api/health does too when accessed with credentials.
  6. If hosted envs changed outside git, capture the change in docs/dev/decisions.md or docs/dev/open_tasks.md.

Incidents

Use these symptoms as the first split:

  • Login/callback failure: Auth0 drift or stale frontend env bundle.
  • Wrong model behavior: hosted env override or stale workflow pin.
  • Startup green but answers wrong: retrieval/model regression; run backend verify plus targeted evals.
  • Startup not green: check /build-info, authenticated /api/health when available, startup integrity, and required assets.

When to stop and escalate

  • Any Auth0 client/app change not already documented.
  • Any deploy workflow or secret-management change that could affect both canary and production.
  • Any model default change away from gpt-5-mini.
  • Any proposal to weaken guardrails, provenance, or timeout behavior.