Spaces:

JohnGenetica
/

ane-kan-runtime

Build error

App Files Files Community

ane-kan-runtime / docs /benchmark_controller_fsm.md

JohnGenetica

Deploy ANE KAN runtime Space

201cf4d verified 16 days ago

preview code

raw

history blame contribute delete

6.87 kB

	# Benchmark Controller FSM (Strict-safe M3 Sweep)

	## Component and scope
	Component: benchmark orchestration path in `training/kan_benchmark_suite.py` that drives matrix runs, applies toolchain gates, and writes per-cell telemetry.

	External dependencies: command-line arguments, toolchain manifest cache, optional training subprocess, filesystem outputs, and user cancellation.

	## State list (mutually exclusive)
	1. `IDLE`
	No run initialized. Invariants: no active case, no snapshot timer.
	2. `PRE_FLIGHT`
	Manifest/cached toolchain status loaded. Invariants: manifest snapshot hash set.
	3. `GATE_WAIT`
	Running toolchain gate checks for a case. Invariants: pending case id is current.
	4. `RUN_READY`
	All non-strict checks passed and base args are frozen for run. Invariants: `kernel_profile`, `runtime_backend_plan`, and sweep params are consistent.
	5. `RUNNING`
	One benchmark cell is executing. Invariants: `run_id` and `seed` are assigned; history stream is hot.
	6. `RUN_COMPLETED`
	Current cell finished and result is buffered. Invariants: final metrics exist or failure row written.
	7. `TEARDOWN`
	Persisting run row and cleaning per-run artifacts. Invariants: output files open.
	8. `ERROR`
	Hard gate or runtime failure; may still emit failure row in non-blocking cases.
	9. `CANCELLED`
	User-initiated cancel/unmount; active run is aborted and best-effort persisted.

	## Events
	- `E_INIT`
	- `E_CONFIG_PARSED`
	- `E_PRE_FLIGHT_OK`
	- `E_PRE_FLIGHT_FAIL`
	- `E_GATE_CHECK_OK`
	- `E_GATE_CHECK_FAIL_WARN`
	- `E_GATE_CHECK_FAIL_ERROR`
	- `E_CASE_START`
	- `E_STEP_DONE`
	- `E_RUN_OK`
	- `E_RUN_FAIL`
	- `E_RETRY`
	- `E_CANCEL`
	- `E_TIMEOUT`
	- `E_UNMOUNT`
	- `E_STALE_EVENT(older_run_id)`
	- `E_NEW_INPUTS`

	## Guards
	- `G_strict_mode`: strict coreml mode is active.
	- `G_gate_requires_coreml`: case path requires strict CoreML visibility.
	- `G_case_requires_ane`: current case runtime plan is ANE/HYBRID.
	- `G_retry_budget`: remaining retries > 0.
	- `G_fresh`: event run id matches current `run_id`.
	- `G_output_ok`: output directory writable.
	- `G_cancel_requested`: cancellation flag set.

	## Side effects
	- Build environment manifest + cache lookup (`_collect_toolchain_manifest`).
	- Evaluate gate (`_evaluate_toolchain_gate`).
	- Create per-case output directory.
	- Instantiate training args (`_set_args_from_base`) and invoke `run_training`.
	- Write per-cell run JSON + summary CSV.
	- Emit console warning/error lines.
	- On cancel/unmount: clear in-flight worker handles and skip remaining scheduled cases.

	## Transition table
	\| state \| event \| guard \| next state \| actions \|
	\|---\|---\|---\|---\|---\|
	\| `IDLE` \| `E_CONFIG_PARSED` \| `G_output_ok` \| `PRE_FLIGHT` \| capture manifest and persist suite manifest \|
	\| `IDLE` \| `E_CONFIG_PARSED` \| `~G_output_ok` \| `ERROR` \| fail fast, emit manifest I/O error \|
	\| `PRE_FLIGHT` \| `E_PRE_FLIGHT_OK` \| `True` \| `GATE_WAIT` \| compute suite defaults and base args \|
	\| `PRE_FLIGHT` \| `E_PRE_FLIGHT_FAIL` \| `True` \| `ERROR` \| add gate diagnostics row, continue if warn \|
	\| `GATE_WAIT` \| `E_GATE_CHECK_OK` \| `~G_gate_requires_coreml OR ~G_strict_mode` \| `RUN_READY` \| record toolchain_gate_issues (empty) \|
	\| `GATE_WAIT` \| `E_GATE_CHECK_FAIL_WARN` \| `~G_strict_mode` \| `RUN_READY` \| record issues; mark warning metadata \|
	\| `GATE_WAIT` \| `E_GATE_CHECK_FAIL_ERROR` \| `G_strict_mode` \| `ERROR` \| throw/fail row with `coreml` reason \|
	\| `RUN_READY` \| `E_CASE_START` \| `G_fresh AND ~G_cancel_requested` \| `RUNNING` \| set `run_id`, `seed`, case overrides \|
	\| `RUN_READY` \| `E_NEW_INPUTS` \| `G_fresh` \| `RUN_READY` \| update next-case policy and rebuild base args \|
	\| `RUNNING` \| `E_STEP_DONE` \| `G_fresh` \| `RUNNING` \| append telemetry step from history stream \|
	\| `RUNNING` \| `E_RUN_OK` \| `G_fresh` \| `RUN_COMPLETED` \| finalize metrics and compute row-level ratios \|
	\| `RUNNING` \| `E_RUN_FAIL` \| `G_fresh` \| `RUN_COMPLETED` \| persist failure row with `toolchain_gate_ok=False` \|
	\| `RUNNING` \| `E_RUN_FAIL` \| `~G_fresh` \| `RUNNING` \| drop stale result, retain active run \|
	\| `RUNNING` \| `E_TIMEOUT` \| `G_retry_budget` \| `ERROR` \| cancel/retry with backoff policy \|
	\| `RUN_COMPLETED` \| `E_CASE_START` \| `G_fresh` \| `TEARDOWN` \| collect manifest + append run_result \|
	\| `RUN_COMPLETED` \| `E_CANCEL` \| `~G_cancel_requested` \| `TEARDOWN` \| mark incomplete row and break loops \|
	\| `TEARDOWN` \| `E_STEP_DONE` \| `True` \| `TEARDOWN` \| continue writing CSV artifact updates \|
	\| `TEARDOWN` \| `E_RUN_OK` \| `run remaining cases` \| `GATE_WAIT` \| schedule next case \|
	\| `TEARDOWN` \| `E_RUN_OK` \| `~run remaining cases` \| `IDLE` \| emit final report paths \|
	\| `ERROR` \| `E_RETRY` \| `G_retry_budget` \| `GATE_WAIT` \| re-run last case with updated seed/backoff \|
	\| `ERROR` \| `E_CANCEL` \| `True` \| `CANCELLED` \| stop scheduling, persist partial report \|
	\| `CANCELLED` \| `E_UNMOUNT` \| `True` \| `IDLE` \| flush pending writes, close handles \|
	\| any \| `E_CANCEL` \| `G_cancel_requested` \| `CANCELLED` \| set abort flag and stop future case launches \|

	## Mermaid
	```mermaid
	stateDiagram-v2
	[*] --> IDLE
	IDLE --> PRE_FLIGHT : E_CONFIG_PARSED / capture_manifest
	PRE_FLIGHT --> GATE_WAIT : E_PRE_FLIGHT_OK
	PRE_FLIGHT --> ERROR : E_PRE_FLIGHT_FAIL
	GATE_WAIT --> RUN_READY : E_GATE_CHECK_OK
	GATE_WAIT --> RUN_READY : E_GATE_CHECK_FAIL_WARN
	GATE_WAIT --> ERROR : E_GATE_CHECK_FAIL_ERROR
	RUN_READY --> RUNNING : E_CASE_START
	RUNNING --> RUNNING : E_STEP_DONE
	RUNNING --> RUN_COMPLETED : E_RUN_OK
	RUNNING --> RUN_COMPLETED : E_RUN_FAIL
	RUN_COMPLETED --> TEARDOWN : E_CASE_START
	TEARDOWN --> GATE_WAIT : next_case
	TEARDOWN --> IDLE : all_cases_done
	ERROR --> GATE_WAIT : E_RETRY
	ERROR --> CANCELLED : E_CANCEL
	CANCELLED --> IDLE : E_UNMOUNT
	RUN_READY --> CANCELLED : E_CANCEL
	RUNNING --> CANCELLED : E_CANCEL
	RUN_COMPLETED --> CANCELLED : E_CANCEL
	IDLE --> [*] : process_end
	```

	## Race and stale-event handling
	- Older in-flight run events are ignored using `run_id` guard (`G_fresh`).
	- If `E_NEW_INPUTS` arrives while `RUNNING`, latest override is accepted only after current run enters `TEARDOWN`.
	- `E_CANCEL` always has priority over `E_STEP_DONE` and transitions directly to `CANCELLED`.
	- On unmount, only the latest active run ID is allowed to persist output; stale completions are dropped.

	## Edge-coverage tests
	1. Start in `IDLE`; strict gate missing-coreml in `error` mode => `GATE_WAIT -> ERROR`.
	2. Warn mode missing-coreml => `GATE_WAIT -> RUN_READY` with warning metadata.
	3. `RUNNING` stale result while next run started => stale event dropped, active run continues.
	4. `E_CANCEL` during `RUNNING` => no additional case launches after current step.
	5. Retry path from `ERROR` executes when retry budget remains and clears last failed case cache.
	6. `toolchain_gate_coreml_issue` populated only when gate failure string contains coreml keywords.