math_trainer / PRODUCTION.md
NorthernTribe-Research's picture
Production UI/telemetry upgrade + monochrome theme + safety hardening
f0734c2 verified
# Production Runbook
## 1. Pre-Deploy Checks
Run all checks from `space_trainer/`:
```bash
python scripts/preflight_check.py
python -m unittest discover -s tests -v
```
Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run):
```bash
python scripts/preflight_check.py --run-training-dry-run
```
## 2. Runtime Configuration
Set runtime secrets in Hugging Face Space settings:
- `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`)
Optional safety overrides:
- `CONTINUOUS_RESTART_DELAY_SECONDS` (default `15`)
- `CONTINUOUS_MAX_CONSECUTIVE_FAILURES` (default `3`)
- `APP_LOG_MAX_CHARS` (default `200000`)
- `RUN_HISTORY_LIMIT` (default `80`)
## 3. Release Checklist
1. Ensure pre-deploy checks are green.
2. Ensure `requirements.txt` includes all runtime dependencies.
3. Deploy Space files (exclude `workspace/` artifacts).
4. Wait for Space runtime stage to reach `RUNNING`.
5. Trigger a UI preflight run (`Validation Mode (No Training)`).
6. Trigger one non-autonomous single-stage run before enabling continuous autonomous mode.
7. Confirm `workspace/runtime/run_history.json` is being updated and recent run cards render in telemetry.
## 4. Rollback Strategy
1. Re-deploy the last known good commit to the Space.
2. Disable `Continuous Auto-Restart`.
3. Run preflight mode only until health is restored.
4. Re-enable autonomous/continuous mode after one successful full run.
## 5. Operational Notes
- Full run records are persisted under `workspace/runtime/run_records/`.
- The compact run index at `workspace/runtime/run_history.json` is capped by `RUN_HISTORY_LIMIT`.