ane-kan-runtime / docs /benchmark_controller_fsm.md
JohnGenetica's picture
Deploy ANE KAN runtime Space
201cf4d verified
# Benchmark Controller FSM (Strict-safe M3 Sweep)
## Component and scope
Component: benchmark orchestration path in `training/kan_benchmark_suite.py` that drives matrix runs, applies toolchain gates, and writes per-cell telemetry.
External dependencies: command-line arguments, toolchain manifest cache, optional training subprocess, filesystem outputs, and user cancellation.
## State list (mutually exclusive)
1. `IDLE`
No run initialized. Invariants: no active case, no snapshot timer.
2. `PRE_FLIGHT`
Manifest/cached toolchain status loaded. Invariants: manifest snapshot hash set.
3. `GATE_WAIT`
Running toolchain gate checks for a case. Invariants: pending case id is current.
4. `RUN_READY`
All non-strict checks passed and base args are frozen for run. Invariants: `kernel_profile`, `runtime_backend_plan`, and sweep params are consistent.
5. `RUNNING`
One benchmark cell is executing. Invariants: `run_id` and `seed` are assigned; history stream is hot.
6. `RUN_COMPLETED`
Current cell finished and result is buffered. Invariants: final metrics exist or failure row written.
7. `TEARDOWN`
Persisting run row and cleaning per-run artifacts. Invariants: output files open.
8. `ERROR`
Hard gate or runtime failure; may still emit failure row in non-blocking cases.
9. `CANCELLED`
User-initiated cancel/unmount; active run is aborted and best-effort persisted.
## Events
- `E_INIT`
- `E_CONFIG_PARSED`
- `E_PRE_FLIGHT_OK`
- `E_PRE_FLIGHT_FAIL`
- `E_GATE_CHECK_OK`
- `E_GATE_CHECK_FAIL_WARN`
- `E_GATE_CHECK_FAIL_ERROR`
- `E_CASE_START`
- `E_STEP_DONE`
- `E_RUN_OK`
- `E_RUN_FAIL`
- `E_RETRY`
- `E_CANCEL`
- `E_TIMEOUT`
- `E_UNMOUNT`
- `E_STALE_EVENT(older_run_id)`
- `E_NEW_INPUTS`
## Guards
- `G_strict_mode`: strict coreml mode is active.
- `G_gate_requires_coreml`: case path requires strict CoreML visibility.
- `G_case_requires_ane`: current case runtime plan is ANE/HYBRID.
- `G_retry_budget`: remaining retries > 0.
- `G_fresh`: event run id matches current `run_id`.
- `G_output_ok`: output directory writable.
- `G_cancel_requested`: cancellation flag set.
## Side effects
- Build environment manifest + cache lookup (`_collect_toolchain_manifest`).
- Evaluate gate (`_evaluate_toolchain_gate`).
- Create per-case output directory.
- Instantiate training args (`_set_args_from_base`) and invoke `run_training`.
- Write per-cell run JSON + summary CSV.
- Emit console warning/error lines.
- On cancel/unmount: clear in-flight worker handles and skip remaining scheduled cases.
## Transition table
| state | event | guard | next state | actions |
|---|---|---|---|---|
| `IDLE` | `E_CONFIG_PARSED` | `G_output_ok` | `PRE_FLIGHT` | capture manifest and persist suite manifest |
| `IDLE` | `E_CONFIG_PARSED` | `~G_output_ok` | `ERROR` | fail fast, emit manifest I/O error |
| `PRE_FLIGHT` | `E_PRE_FLIGHT_OK` | `True` | `GATE_WAIT` | compute suite defaults and base args |
| `PRE_FLIGHT` | `E_PRE_FLIGHT_FAIL` | `True` | `ERROR` | add gate diagnostics row, continue if warn |
| `GATE_WAIT` | `E_GATE_CHECK_OK` | `~G_gate_requires_coreml OR ~G_strict_mode` | `RUN_READY` | record toolchain_gate_issues (empty) |
| `GATE_WAIT` | `E_GATE_CHECK_FAIL_WARN` | `~G_strict_mode` | `RUN_READY` | record issues; mark warning metadata |
| `GATE_WAIT` | `E_GATE_CHECK_FAIL_ERROR` | `G_strict_mode` | `ERROR` | throw/fail row with `coreml` reason |
| `RUN_READY` | `E_CASE_START` | `G_fresh AND ~G_cancel_requested` | `RUNNING` | set `run_id`, `seed`, case overrides |
| `RUN_READY` | `E_NEW_INPUTS` | `G_fresh` | `RUN_READY` | update next-case policy and rebuild base args |
| `RUNNING` | `E_STEP_DONE` | `G_fresh` | `RUNNING` | append telemetry step from history stream |
| `RUNNING` | `E_RUN_OK` | `G_fresh` | `RUN_COMPLETED` | finalize metrics and compute row-level ratios |
| `RUNNING` | `E_RUN_FAIL` | `G_fresh` | `RUN_COMPLETED` | persist failure row with `toolchain_gate_ok=False` |
| `RUNNING` | `E_RUN_FAIL` | `~G_fresh` | `RUNNING` | drop stale result, retain active run |
| `RUNNING` | `E_TIMEOUT` | `G_retry_budget` | `ERROR` | cancel/retry with backoff policy |
| `RUN_COMPLETED` | `E_CASE_START` | `G_fresh` | `TEARDOWN` | collect manifest + append run_result |
| `RUN_COMPLETED` | `E_CANCEL` | `~G_cancel_requested` | `TEARDOWN` | mark incomplete row and break loops |
| `TEARDOWN` | `E_STEP_DONE` | `True` | `TEARDOWN` | continue writing CSV artifact updates |
| `TEARDOWN` | `E_RUN_OK` | `run remaining cases` | `GATE_WAIT` | schedule next case |
| `TEARDOWN` | `E_RUN_OK` | `~run remaining cases` | `IDLE` | emit final report paths |
| `ERROR` | `E_RETRY` | `G_retry_budget` | `GATE_WAIT` | re-run last case with updated seed/backoff |
| `ERROR` | `E_CANCEL` | `True` | `CANCELLED` | stop scheduling, persist partial report |
| `CANCELLED` | `E_UNMOUNT` | `True` | `IDLE` | flush pending writes, close handles |
| any | `E_CANCEL` | `G_cancel_requested` | `CANCELLED` | set abort flag and stop future case launches |
## Mermaid
```mermaid
stateDiagram-v2
[*] --> IDLE
IDLE --> PRE_FLIGHT : E_CONFIG_PARSED / capture_manifest
PRE_FLIGHT --> GATE_WAIT : E_PRE_FLIGHT_OK
PRE_FLIGHT --> ERROR : E_PRE_FLIGHT_FAIL
GATE_WAIT --> RUN_READY : E_GATE_CHECK_OK
GATE_WAIT --> RUN_READY : E_GATE_CHECK_FAIL_WARN
GATE_WAIT --> ERROR : E_GATE_CHECK_FAIL_ERROR
RUN_READY --> RUNNING : E_CASE_START
RUNNING --> RUNNING : E_STEP_DONE
RUNNING --> RUN_COMPLETED : E_RUN_OK
RUNNING --> RUN_COMPLETED : E_RUN_FAIL
RUN_COMPLETED --> TEARDOWN : E_CASE_START
TEARDOWN --> GATE_WAIT : next_case
TEARDOWN --> IDLE : all_cases_done
ERROR --> GATE_WAIT : E_RETRY
ERROR --> CANCELLED : E_CANCEL
CANCELLED --> IDLE : E_UNMOUNT
RUN_READY --> CANCELLED : E_CANCEL
RUNNING --> CANCELLED : E_CANCEL
RUN_COMPLETED --> CANCELLED : E_CANCEL
IDLE --> [*] : process_end
```
## Race and stale-event handling
- Older in-flight run events are ignored using `run_id` guard (`G_fresh`).
- If `E_NEW_INPUTS` arrives while `RUNNING`, latest override is accepted only after current run enters `TEARDOWN`.
- `E_CANCEL` always has priority over `E_STEP_DONE` and transitions directly to `CANCELLED`.
- On unmount, only the latest active run ID is allowed to persist output; stale completions are dropped.
## Edge-coverage tests
1. Start in `IDLE`; strict gate missing-coreml in `error` mode => `GATE_WAIT -> ERROR`.
2. Warn mode missing-coreml => `GATE_WAIT -> RUN_READY` with warning metadata.
3. `RUNNING` stale result while next run started => stale event dropped, active run continues.
4. `E_CANCEL` during `RUNNING` => no additional case launches after current step.
5. Retry path from `ERROR` executes when retry budget remains and clears last failed case cache.
6. `toolchain_gate_coreml_issue` populated only when gate failure string contains coreml keywords.