Spaces:
Sleeping
Sleeping
| # Production Runbook | |
| ## 1. Pre-Deploy Checks | |
| Run all checks from `space_trainer/`: | |
| ```bash | |
| python scripts/preflight_check.py | |
| python -m unittest discover -s tests -v | |
| ``` | |
| Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run): | |
| ```bash | |
| python scripts/preflight_check.py --run-training-dry-run | |
| ``` | |
| ## 2. Runtime Configuration | |
| Set runtime secrets in Hugging Face Space settings: | |
| - `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`) | |
| Optional safety overrides: | |
| - `CONTINUOUS_RESTART_DELAY_SECONDS` (default `15`) | |
| - `CONTINUOUS_MAX_CONSECUTIVE_FAILURES` (default `3`) | |
| - `APP_LOG_MAX_CHARS` (default `200000`) | |
| - `RUN_HISTORY_LIMIT` (default `80`) | |
| ## 3. Release Checklist | |
| 1. Ensure pre-deploy checks are green. | |
| 2. Ensure `requirements.txt` includes all runtime dependencies. | |
| 3. Deploy Space files (exclude `workspace/` artifacts). | |
| 4. Wait for Space runtime stage to reach `RUNNING`. | |
| 5. Trigger a UI preflight run (`Validation Mode (No Training)`). | |
| 6. Trigger one non-autonomous single-stage run before enabling continuous autonomous mode. | |
| 7. Confirm `workspace/runtime/run_history.json` is being updated and recent run cards render in telemetry. | |
| ## 4. Rollback Strategy | |
| 1. Re-deploy the last known good commit to the Space. | |
| 2. Disable `Continuous Auto-Restart`. | |
| 3. Run preflight mode only until health is restored. | |
| 4. Re-enable autonomous/continuous mode after one successful full run. | |
| ## 5. Operational Notes | |
| - Full run records are persisted under `workspace/runtime/run_records/`. | |
| - The compact run index at `workspace/runtime/run_history.json` is capped by `RUN_HISTORY_LIMIT`. | |