Masters Telecom Toolkit Runbook
This runbook is the operator surface for local verification, Hugging Face deployment, hosted validation, rollback, and Auth0/runtime drift handling.
Source of truth
- Architecture:
docs/architecture.md - HF/Auth0 operator checklist:
docs/hf_auth0_operator_checklist.md - Repo rules for Codex and contributors:
AGENTS.md - Canonical eval guidance:
docs/evals/README.md - Release log:
docs/dev/release_log.md
Local verify bar
Run these before merge for meaningful app, auth, retrieval, or deployment changes.
cd backend && bash scripts/test_backend.sh
cd frontend && npx tsc -p tsconfig.json --noEmit
cd frontend && npm run build
Run the stronger bar for backend contracts, auth, retrieval logic, model/env changes, or deploy workflow edits.
cd backend && bash scripts/test_backend.sh --full
cd backend && bash scripts/release_gate.sh
Local canary bootstrap
Use this when working from the canary stabilization worktree instead of the main worktree.
Minimum local prerequisites:
- Root env for backend gates:
ln -s /path/to/main/.env.codex /path/to/canary/.env.codex
backend/scripts/release_gate.sh auto-sources the canary worktree root .env.codex. If this file is missing, the gate will fail early on OPENAI_API_KEY.
- Shared Router RAG corpus path:
ln -s /path/to/main/_RAG_Ready_KB_Organized /path/to/canary/_RAG_Ready_KB_Organized
Without a local corpus path, backend verify and router RAG smoke can fail on missing rag_ingestion_chunks.jsonl.
- Hosted smoke env for the canary frontend:
Create frontend/.env.e2e with at least:
E2E_DISABLE_WEBSERVER=true
E2E_BASE_URL=https://crazycrazypete-masters-four-tab-openai-canary.hf.space
E2E_AUTH0_DOMAIN=<your-auth0-domain>
E2E_AUTH_TEST_EMAIL=<test-user-email>
E2E_AUTH_TEST_PASSWORD=<test-user-password>
frontend/scripts/run-hosted-smoke.sh checks for .env.e2e unless E2E_ENV_FILE is explicitly set.
Deploy path
Use .github/workflows/deploy-hf-gated.yml as the supported deploy path.
Pre-deploy expectations:
maincontains the intended release.- Repo/workflow model defaults still align on
gpt-5-mini. - Auth0 audience placeholders are absent.
- Required GitHub repository secrets are present:
HF_TOKENHF_SPACE_IDHF_SPACE_ID_CANARYif using canaryHF_USERNAMEif neededOPENAI_API_KEY
- The gated deploy workflow actively clears forbidden legacy audience envs (
AUTH0_AUDIENCE,VITE_AUTH0_AUDIENCE) from HF targets before stamping build metadata.
Deploy sequence:
- GitHub Actions runs auth/security tests, router regressions, and the RAG quality gate.
- Canary deploy pushes to the HF canary Space and stamps build metadata.
- Hosted validator checks
/build-infoplus/api/healthwhen accessible for the expected build, startup integrity, and auth posture. - Production deploy pushes the same ref to the production Space.
- Hosted validator repeats against production.
- Operator re-runs the minimal hosted smoke bundle on both production and canary:
cd frontend && ./scripts/run-hosted-smoke.sh
- Operator completes the HF/Auth0 checklist and signs off.
- Operator appends the deploy result to
docs/dev/release_log.md.
If you do a manual HF git push for canary-only testing instead of the gated workflow, you must also restamp the canary Space variables:
MASTERS_TOOLKIT_BUILD_VERSIONMASTERS_TOOLKIT_GIT_SHA
Otherwise /build-info can report stale metadata even when the runtime SHA is current.
Hosted validation
Use this after any deploy, manual Space config change, or suspected HF/Auth0 drift.
python backend/scripts/validate_hosted_runtime.py \
--base-url https://your-space.hf.space \
--expected-build-version release-... \
--expected-git-sha <sha> \
--expect-auth-required true \
--expect-auth-enabled true
What it validates:
- hosted build version and git SHA
- startup integrity status
/api/healthoverall status when the endpoint is publicly reachable- auth-required and auth-enabled posture
- no effective use of forbidden legacy audiences
- no forbidden Auth0 placeholder warning/detail text
Release log
After every successful deploy or rollback, add one entry to docs/dev/release_log.md.
Each entry should include:
- date
- GitHub Actions run id
- deployed git SHA
- build version
- target lanes used: canary, production, or both
- hosted validator result
- hosted smoke result
- any manual HF/Auth0 config changes made during the release
Rollback
Use the same deploy workflow with workflow_dispatch and rollback_ref.
Rollback sequence:
- Choose the last known-good commit or tag.
- Run
Deploy HF (Gated Canary + Rollback)withrollback_ref=<known-good-ref>. - Let canary validate first unless production recovery requires
skip_canary=true. - Re-run hosted validation against the rollback target.
- Re-run the HF/Auth0 operator checklist before declaring recovery complete.
Auth0 drift response
Use this when login fails, callback behavior changes, or hosted auth diagnostics report config issues.
Check in order:
APP_BASE_URLmatches the intended canonical host.VITE_APP_BASE_URLmatches the same canonical host.AUTH0_DOMAIN,AUTH0_CLIENT_ID, andVITE_AUTH0_CLIENT_IDare correct for the SPA app.AUTH0_AUDIENCEandVITE_AUTH0_AUDIENCEare blank unless a real API Identifier exists.- Auth0 allowed callback, logout, and web origin lists include the deployed host.
- Hosted Space was rebuilt after any frontend auth env change.
- Credentialed auth smoke still succeeds on the intended target.
Model or env drift response
Use this when hosted behavior does not match repo decisions.
Check in order:
- GitHub workflow env pins still use
gpt-5-mini. - Hugging Face Space variables do not override repo defaults with stale model values.
- Build metadata on
/build-infomatches the expected deploy ref. /build-infoshows clean startup and auth posture, and the reportedAPP_BASE_URL/VITE_APP_BASE_URLorigins match the deployed host./api/healthdoes too when accessed with credentials.- If hosted envs changed outside git, capture the change in
docs/dev/decisions.mdordocs/dev/open_tasks.md.
Incidents
Use these symptoms as the first split:
- Login/callback failure: Auth0 drift or stale frontend env bundle.
- Wrong model behavior: hosted env override or stale workflow pin.
- Startup green but answers wrong: retrieval/model regression; run backend verify plus targeted evals.
- Startup not green: check
/build-info, authenticated/api/healthwhen available, startup integrity, and required assets.
When to stop and escalate
- Any Auth0 client/app change not already documented.
- Any deploy workflow or secret-management change that could affect both canary and production.
- Any model default change away from
gpt-5-mini. - Any proposal to weaken guardrails, provenance, or timeout behavior.