| # Masters Telecom Toolkit Runbook |
|
|
| This runbook is the operator surface for local verification, Hugging Face deployment, hosted validation, rollback, and Auth0/runtime drift handling. |
|
|
| ## Source of truth |
|
|
| - Architecture: `docs/architecture.md` |
| - HF/Auth0 operator checklist: `docs/hf_auth0_operator_checklist.md` |
| - Repo rules for Codex and contributors: `AGENTS.md` |
| - Canonical eval guidance: `docs/evals/README.md` |
| - Release log: `docs/dev/release_log.md` |
|
|
| ## Local verify bar |
|
|
| Run these before merge for meaningful app, auth, retrieval, or deployment changes. |
|
|
| ```bash |
| cd backend && bash scripts/test_backend.sh |
| cd frontend && npx tsc -p tsconfig.json --noEmit |
| cd frontend && npm run build |
| ``` |
|
|
| Run the stronger bar for backend contracts, auth, retrieval logic, model/env changes, or deploy workflow edits. |
|
|
| ```bash |
| cd backend && bash scripts/test_backend.sh --full |
| cd backend && bash scripts/release_gate.sh |
| ``` |
|
|
| ## Local canary bootstrap |
|
|
| Use this when working from the canary stabilization worktree instead of the main worktree. |
|
|
| Minimum local prerequisites: |
|
|
| 1. Root env for backend gates: |
|
|
| ```bash |
| ln -s /path/to/main/.env.codex /path/to/canary/.env.codex |
| ``` |
|
|
| `backend/scripts/release_gate.sh` auto-sources the canary worktree root `.env.codex`. If this file is missing, the gate will fail early on `OPENAI_API_KEY`. |
|
|
| 2. Shared Router RAG corpus path: |
|
|
| ```bash |
| ln -s /path/to/main/_RAG_Ready_KB_Organized /path/to/canary/_RAG_Ready_KB_Organized |
| ``` |
|
|
| Without a local corpus path, backend verify and router RAG smoke can fail on missing `rag_ingestion_chunks.jsonl`. |
|
|
| 3. Hosted smoke env for the canary frontend: |
|
|
| Create `frontend/.env.e2e` with at least: |
|
|
| ```env |
| E2E_DISABLE_WEBSERVER=true |
| E2E_BASE_URL=https://crazycrazypete-masters-four-tab-openai-canary.hf.space |
| E2E_AUTH0_DOMAIN=<your-auth0-domain> |
| E2E_AUTH_TEST_EMAIL=<test-user-email> |
| E2E_AUTH_TEST_PASSWORD=<test-user-password> |
| ``` |
|
|
| `frontend/scripts/run-hosted-smoke.sh` checks for `.env.e2e` unless `E2E_ENV_FILE` is explicitly set. |
|
|
| ## Deploy path |
|
|
| Use `.github/workflows/deploy-hf-gated.yml` as the supported deploy path. |
|
|
| Pre-deploy expectations: |
|
|
| - `main` contains the intended release. |
| - Repo/workflow model defaults still align on `gpt-5-mini`. |
| - Auth0 audience placeholders are absent. |
| - Required GitHub repository secrets are present: |
| - `HF_TOKEN` |
| - `HF_SPACE_ID` |
| - `HF_SPACE_ID_CANARY` if using canary |
| - `HF_USERNAME` if needed |
| - `OPENAI_API_KEY` |
| - The gated deploy workflow actively clears forbidden legacy audience envs (`AUTH0_AUDIENCE`, `VITE_AUTH0_AUDIENCE`) from HF targets before stamping build metadata. |
|
|
| Deploy sequence: |
|
|
| 1. GitHub Actions runs auth/security tests, router regressions, and the RAG quality gate. |
| 2. Canary deploy pushes to the HF canary Space and stamps build metadata. |
| 3. Hosted validator checks `/build-info` plus `/api/health` when accessible for the expected build, startup integrity, and auth posture. |
| 4. Production deploy pushes the same ref to the production Space. |
| 5. Hosted validator repeats against production. |
| 6. Operator re-runs the minimal hosted smoke bundle on both production and canary: |
|
|
| ```bash |
| cd frontend && ./scripts/run-hosted-smoke.sh |
| ``` |
|
|
| 7. Operator completes the HF/Auth0 checklist and signs off. |
| 8. Operator appends the deploy result to `docs/dev/release_log.md`. |
|
|
| If you do a manual HF git push for canary-only testing instead of the gated workflow, you must also restamp the canary Space variables: |
|
|
| - `MASTERS_TOOLKIT_BUILD_VERSION` |
| - `MASTERS_TOOLKIT_GIT_SHA` |
|
|
| Otherwise `/build-info` can report stale metadata even when the runtime SHA is current. |
|
|
| ## Hosted validation |
|
|
| Use this after any deploy, manual Space config change, or suspected HF/Auth0 drift. |
|
|
| ```bash |
| python backend/scripts/validate_hosted_runtime.py \ |
| --base-url https://your-space.hf.space \ |
| --expected-build-version release-... \ |
| --expected-git-sha <sha> \ |
| --expect-auth-required true \ |
| --expect-auth-enabled true |
| ``` |
|
|
| What it validates: |
|
|
| - hosted build version and git SHA |
| - startup integrity status |
| - `/api/health` overall status when the endpoint is publicly reachable |
| - auth-required and auth-enabled posture |
| - no effective use of forbidden legacy audiences |
| - no forbidden Auth0 placeholder warning/detail text |
|
|
| ## Release log |
|
|
| After every successful deploy or rollback, add one entry to `docs/dev/release_log.md`. |
|
|
| Each entry should include: |
|
|
| - date |
| - GitHub Actions run id |
| - deployed git SHA |
| - build version |
| - target lanes used: canary, production, or both |
| - hosted validator result |
| - hosted smoke result |
| - any manual HF/Auth0 config changes made during the release |
|
|
| ## Rollback |
|
|
| Use the same deploy workflow with `workflow_dispatch` and `rollback_ref`. |
|
|
| Rollback sequence: |
|
|
| 1. Choose the last known-good commit or tag. |
| 2. Run `Deploy HF (Gated Canary + Rollback)` with `rollback_ref=<known-good-ref>`. |
| 3. Let canary validate first unless production recovery requires `skip_canary=true`. |
| 4. Re-run hosted validation against the rollback target. |
| 5. Re-run the HF/Auth0 operator checklist before declaring recovery complete. |
|
|
| ## Auth0 drift response |
|
|
| Use this when login fails, callback behavior changes, or hosted auth diagnostics report config issues. |
|
|
| Check in order: |
|
|
| 1. `APP_BASE_URL` matches the intended canonical host. |
| 2. `VITE_APP_BASE_URL` matches the same canonical host. |
| 3. `AUTH0_DOMAIN`, `AUTH0_CLIENT_ID`, and `VITE_AUTH0_CLIENT_ID` are correct for the SPA app. |
| 4. `AUTH0_AUDIENCE` and `VITE_AUTH0_AUDIENCE` are blank unless a real API Identifier exists. |
| 5. Auth0 allowed callback, logout, and web origin lists include the deployed host. |
| 6. Hosted Space was rebuilt after any frontend auth env change. |
| 7. Credentialed auth smoke still succeeds on the intended target. |
|
|
| ## Model or env drift response |
|
|
| Use this when hosted behavior does not match repo decisions. |
|
|
| Check in order: |
|
|
| 1. GitHub workflow env pins still use `gpt-5-mini`. |
| 2. Hugging Face Space variables do not override repo defaults with stale model values. |
| 3. Build metadata on `/build-info` matches the expected deploy ref. |
| 4. `/build-info` shows clean startup and auth posture, and the reported `APP_BASE_URL` / `VITE_APP_BASE_URL` origins match the deployed host. |
| 5. `/api/health` does too when accessed with credentials. |
| 6. If hosted envs changed outside git, capture the change in `docs/dev/decisions.md` or `docs/dev/open_tasks.md`. |
|
|
| ## Incidents |
|
|
| Use these symptoms as the first split: |
|
|
| - Login/callback failure: Auth0 drift or stale frontend env bundle. |
| - Wrong model behavior: hosted env override or stale workflow pin. |
| - Startup green but answers wrong: retrieval/model regression; run backend verify plus targeted evals. |
| - Startup not green: check `/build-info`, authenticated `/api/health` when available, startup integrity, and required assets. |
|
|
| ## When to stop and escalate |
|
|
| - Any Auth0 client/app change not already documented. |
| - Any deploy workflow or secret-management change that could affect both canary and production. |
| - Any model default change away from `gpt-5-mini`. |
| - Any proposal to weaken guardrails, provenance, or timeout behavior. |
|
|