File size: 1,605 Bytes
f0734c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Production Runbook

## 1. Pre-Deploy Checks

Run all checks from `space_trainer/`:

```bash
python scripts/preflight_check.py
python -m unittest discover -s tests -v
```

Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run):

```bash
python scripts/preflight_check.py --run-training-dry-run
```

## 2. Runtime Configuration

Set runtime secrets in Hugging Face Space settings:

- `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`)

Optional safety overrides:

- `CONTINUOUS_RESTART_DELAY_SECONDS` (default `15`)
- `CONTINUOUS_MAX_CONSECUTIVE_FAILURES` (default `3`)
- `APP_LOG_MAX_CHARS` (default `200000`)
- `RUN_HISTORY_LIMIT` (default `80`)

## 3. Release Checklist

1. Ensure pre-deploy checks are green.
2. Ensure `requirements.txt` includes all runtime dependencies.
3. Deploy Space files (exclude `workspace/` artifacts).
4. Wait for Space runtime stage to reach `RUNNING`.
5. Trigger a UI preflight run (`Validation Mode (No Training)`).
6. Trigger one non-autonomous single-stage run before enabling continuous autonomous mode.
7. Confirm `workspace/runtime/run_history.json` is being updated and recent run cards render in telemetry.

## 4. Rollback Strategy

1. Re-deploy the last known good commit to the Space.
2. Disable `Continuous Auto-Restart`.
3. Run preflight mode only until health is restored.
4. Re-enable autonomous/continuous mode after one successful full run.

## 5. Operational Notes

- Full run records are persisted under `workspace/runtime/run_records/`.
- The compact run index at `workspace/runtime/run_history.json` is capped by `RUN_HISTORY_LIMIT`.