Pete Dunn
Document canary bootstrap and manual HF stamping
1d82fc3
# Masters Telecom Toolkit Runbook
This runbook is the operator surface for local verification, Hugging Face deployment, hosted validation, rollback, and Auth0/runtime drift handling.
## Source of truth
- Architecture: `docs/architecture.md`
- HF/Auth0 operator checklist: `docs/hf_auth0_operator_checklist.md`
- Repo rules for Codex and contributors: `AGENTS.md`
- Canonical eval guidance: `docs/evals/README.md`
- Release log: `docs/dev/release_log.md`
## Local verify bar
Run these before merge for meaningful app, auth, retrieval, or deployment changes.
```bash
cd backend && bash scripts/test_backend.sh
cd frontend && npx tsc -p tsconfig.json --noEmit
cd frontend && npm run build
```
Run the stronger bar for backend contracts, auth, retrieval logic, model/env changes, or deploy workflow edits.
```bash
cd backend && bash scripts/test_backend.sh --full
cd backend && bash scripts/release_gate.sh
```
## Local canary bootstrap
Use this when working from the canary stabilization worktree instead of the main worktree.
Minimum local prerequisites:
1. Root env for backend gates:
```bash
ln -s /path/to/main/.env.codex /path/to/canary/.env.codex
```
`backend/scripts/release_gate.sh` auto-sources the canary worktree root `.env.codex`. If this file is missing, the gate will fail early on `OPENAI_API_KEY`.
2. Shared Router RAG corpus path:
```bash
ln -s /path/to/main/_RAG_Ready_KB_Organized /path/to/canary/_RAG_Ready_KB_Organized
```
Without a local corpus path, backend verify and router RAG smoke can fail on missing `rag_ingestion_chunks.jsonl`.
3. Hosted smoke env for the canary frontend:
Create `frontend/.env.e2e` with at least:
```env
E2E_DISABLE_WEBSERVER=true
E2E_BASE_URL=https://crazycrazypete-masters-four-tab-openai-canary.hf.space
E2E_AUTH0_DOMAIN=<your-auth0-domain>
E2E_AUTH_TEST_EMAIL=<test-user-email>
E2E_AUTH_TEST_PASSWORD=<test-user-password>
```
`frontend/scripts/run-hosted-smoke.sh` checks for `.env.e2e` unless `E2E_ENV_FILE` is explicitly set.
## Deploy path
Use `.github/workflows/deploy-hf-gated.yml` as the supported deploy path.
Pre-deploy expectations:
- `main` contains the intended release.
- Repo/workflow model defaults still align on `gpt-5-mini`.
- Auth0 audience placeholders are absent.
- Required GitHub repository secrets are present:
- `HF_TOKEN`
- `HF_SPACE_ID`
- `HF_SPACE_ID_CANARY` if using canary
- `HF_USERNAME` if needed
- `OPENAI_API_KEY`
- The gated deploy workflow actively clears forbidden legacy audience envs (`AUTH0_AUDIENCE`, `VITE_AUTH0_AUDIENCE`) from HF targets before stamping build metadata.
Deploy sequence:
1. GitHub Actions runs auth/security tests, router regressions, and the RAG quality gate.
2. Canary deploy pushes to the HF canary Space and stamps build metadata.
3. Hosted validator checks `/build-info` plus `/api/health` when accessible for the expected build, startup integrity, and auth posture.
4. Production deploy pushes the same ref to the production Space.
5. Hosted validator repeats against production.
6. Operator re-runs the minimal hosted smoke bundle on both production and canary:
```bash
cd frontend && ./scripts/run-hosted-smoke.sh
```
7. Operator completes the HF/Auth0 checklist and signs off.
8. Operator appends the deploy result to `docs/dev/release_log.md`.
If you do a manual HF git push for canary-only testing instead of the gated workflow, you must also restamp the canary Space variables:
- `MASTERS_TOOLKIT_BUILD_VERSION`
- `MASTERS_TOOLKIT_GIT_SHA`
Otherwise `/build-info` can report stale metadata even when the runtime SHA is current.
## Hosted validation
Use this after any deploy, manual Space config change, or suspected HF/Auth0 drift.
```bash
python backend/scripts/validate_hosted_runtime.py \
--base-url https://your-space.hf.space \
--expected-build-version release-... \
--expected-git-sha <sha> \
--expect-auth-required true \
--expect-auth-enabled true
```
What it validates:
- hosted build version and git SHA
- startup integrity status
- `/api/health` overall status when the endpoint is publicly reachable
- auth-required and auth-enabled posture
- no effective use of forbidden legacy audiences
- no forbidden Auth0 placeholder warning/detail text
## Release log
After every successful deploy or rollback, add one entry to `docs/dev/release_log.md`.
Each entry should include:
- date
- GitHub Actions run id
- deployed git SHA
- build version
- target lanes used: canary, production, or both
- hosted validator result
- hosted smoke result
- any manual HF/Auth0 config changes made during the release
## Rollback
Use the same deploy workflow with `workflow_dispatch` and `rollback_ref`.
Rollback sequence:
1. Choose the last known-good commit or tag.
2. Run `Deploy HF (Gated Canary + Rollback)` with `rollback_ref=<known-good-ref>`.
3. Let canary validate first unless production recovery requires `skip_canary=true`.
4. Re-run hosted validation against the rollback target.
5. Re-run the HF/Auth0 operator checklist before declaring recovery complete.
## Auth0 drift response
Use this when login fails, callback behavior changes, or hosted auth diagnostics report config issues.
Check in order:
1. `APP_BASE_URL` matches the intended canonical host.
2. `VITE_APP_BASE_URL` matches the same canonical host.
3. `AUTH0_DOMAIN`, `AUTH0_CLIENT_ID`, and `VITE_AUTH0_CLIENT_ID` are correct for the SPA app.
4. `AUTH0_AUDIENCE` and `VITE_AUTH0_AUDIENCE` are blank unless a real API Identifier exists.
5. Auth0 allowed callback, logout, and web origin lists include the deployed host.
6. Hosted Space was rebuilt after any frontend auth env change.
7. Credentialed auth smoke still succeeds on the intended target.
## Model or env drift response
Use this when hosted behavior does not match repo decisions.
Check in order:
1. GitHub workflow env pins still use `gpt-5-mini`.
2. Hugging Face Space variables do not override repo defaults with stale model values.
3. Build metadata on `/build-info` matches the expected deploy ref.
4. `/build-info` shows clean startup and auth posture, and the reported `APP_BASE_URL` / `VITE_APP_BASE_URL` origins match the deployed host.
5. `/api/health` does too when accessed with credentials.
6. If hosted envs changed outside git, capture the change in `docs/dev/decisions.md` or `docs/dev/open_tasks.md`.
## Incidents
Use these symptoms as the first split:
- Login/callback failure: Auth0 drift or stale frontend env bundle.
- Wrong model behavior: hosted env override or stale workflow pin.
- Startup green but answers wrong: retrieval/model regression; run backend verify plus targeted evals.
- Startup not green: check `/build-info`, authenticated `/api/health` when available, startup integrity, and required assets.
## When to stop and escalate
- Any Auth0 client/app change not already documented.
- Any deploy workflow or secret-management change that could affect both canary and production.
- Any model default change away from `gpt-5-mini`.
- Any proposal to weaken guardrails, provenance, or timeout behavior.