Spaces:

JohnGenetica
/

ane-kan-runtime

Build error

App Files Files Community

ane-kan-runtime / docs /benchmark_controller_fsm.md

JohnGenetica

Deploy ANE KAN runtime Space

201cf4d verified 13 days ago

preview code

raw

history blame contribute delete

6.87 kB

Benchmark Controller FSM (Strict-safe M3 Sweep)

Component and scope

Component: benchmark orchestration path in training/kan_benchmark_suite.py that drives matrix runs, applies toolchain gates, and writes per-cell telemetry.

External dependencies: command-line arguments, toolchain manifest cache, optional training subprocess, filesystem outputs, and user cancellation.

State list (mutually exclusive)

IDLE
No run initialized. Invariants: no active case, no snapshot timer.
PRE_FLIGHT
Manifest/cached toolchain status loaded. Invariants: manifest snapshot hash set.
GATE_WAIT
Running toolchain gate checks for a case. Invariants: pending case id is current.
RUN_READY
All non-strict checks passed and base args are frozen for run. Invariants: kernel_profile, runtime_backend_plan, and sweep params are consistent.
RUNNING
One benchmark cell is executing. Invariants: run_id and seed are assigned; history stream is hot.
RUN_COMPLETED
Current cell finished and result is buffered. Invariants: final metrics exist or failure row written.
TEARDOWN
Persisting run row and cleaning per-run artifacts. Invariants: output files open.
ERROR
Hard gate or runtime failure; may still emit failure row in non-blocking cases.
CANCELLED
User-initiated cancel/unmount; active run is aborted and best-effort persisted.

Events

E_INIT
E_CONFIG_PARSED
E_PRE_FLIGHT_OK
E_PRE_FLIGHT_FAIL
E_GATE_CHECK_OK
E_GATE_CHECK_FAIL_WARN
E_GATE_CHECK_FAIL_ERROR
E_CASE_START
E_STEP_DONE
E_RUN_OK
E_RUN_FAIL
E_RETRY
E_CANCEL
E_TIMEOUT
E_UNMOUNT
E_STALE_EVENT(older_run_id)
E_NEW_INPUTS

Guards

G_strict_mode: strict coreml mode is active.
G_gate_requires_coreml: case path requires strict CoreML visibility.
G_case_requires_ane: current case runtime plan is ANE/HYBRID.
G_retry_budget: remaining retries > 0.
G_fresh: event run id matches current run_id.
G_output_ok: output directory writable.
G_cancel_requested: cancellation flag set.

Side effects

Build environment manifest + cache lookup (_collect_toolchain_manifest).
Evaluate gate (_evaluate_toolchain_gate).
Create per-case output directory.
Instantiate training args (_set_args_from_base) and invoke run_training.
Write per-cell run JSON + summary CSV.
Emit console warning/error lines.
On cancel/unmount: clear in-flight worker handles and skip remaining scheduled cases.

Transition table

state	event	guard	next state	actions
`IDLE`	`E_CONFIG_PARSED`	`G_output_ok`	`PRE_FLIGHT`	capture manifest and persist suite manifest
`IDLE`	`E_CONFIG_PARSED`	`~G_output_ok`	`ERROR`	fail fast, emit manifest I/O error
`PRE_FLIGHT`	`E_PRE_FLIGHT_OK`	`True`	`GATE_WAIT`	compute suite defaults and base args
`PRE_FLIGHT`	`E_PRE_FLIGHT_FAIL`	`True`	`ERROR`	add gate diagnostics row, continue if warn
`GATE_WAIT`	`E_GATE_CHECK_OK`	`~G_gate_requires_coreml OR ~G_strict_mode`	`RUN_READY`	record toolchain_gate_issues (empty)
`GATE_WAIT`	`E_GATE_CHECK_FAIL_WARN`	`~G_strict_mode`	`RUN_READY`	record issues; mark warning metadata
`GATE_WAIT`	`E_GATE_CHECK_FAIL_ERROR`	`G_strict_mode`	`ERROR`	throw/fail row with `coreml` reason
`RUN_READY`	`E_CASE_START`	`G_fresh AND ~G_cancel_requested`	`RUNNING`	set `run_id`, `seed`, case overrides
`RUN_READY`	`E_NEW_INPUTS`	`G_fresh`	`RUN_READY`	update next-case policy and rebuild base args
`RUNNING`	`E_STEP_DONE`	`G_fresh`	`RUNNING`	append telemetry step from history stream
`RUNNING`	`E_RUN_OK`	`G_fresh`	`RUN_COMPLETED`	finalize metrics and compute row-level ratios
`RUNNING`	`E_RUN_FAIL`	`G_fresh`	`RUN_COMPLETED`	persist failure row with `toolchain_gate_ok=False`
`RUNNING`	`E_RUN_FAIL`	`~G_fresh`	`RUNNING`	drop stale result, retain active run
`RUNNING`	`E_TIMEOUT`	`G_retry_budget`	`ERROR`	cancel/retry with backoff policy
`RUN_COMPLETED`	`E_CASE_START`	`G_fresh`	`TEARDOWN`	collect manifest + append run_result
`RUN_COMPLETED`	`E_CANCEL`	`~G_cancel_requested`	`TEARDOWN`	mark incomplete row and break loops
`TEARDOWN`	`E_STEP_DONE`	`True`	`TEARDOWN`	continue writing CSV artifact updates
`TEARDOWN`	`E_RUN_OK`	`run remaining cases`	`GATE_WAIT`	schedule next case
`TEARDOWN`	`E_RUN_OK`	`~run remaining cases`	`IDLE`	emit final report paths
`ERROR`	`E_RETRY`	`G_retry_budget`	`GATE_WAIT`	re-run last case with updated seed/backoff
`ERROR`	`E_CANCEL`	`True`	`CANCELLED`	stop scheduling, persist partial report
`CANCELLED`	`E_UNMOUNT`	`True`	`IDLE`	flush pending writes, close handles
any	`E_CANCEL`	`G_cancel_requested`	`CANCELLED`	set abort flag and stop future case launches

Mermaid

stateDiagram-v2
    [*] --> IDLE
    IDLE --> PRE_FLIGHT : E_CONFIG_PARSED / capture_manifest
    PRE_FLIGHT --> GATE_WAIT : E_PRE_FLIGHT_OK
    PRE_FLIGHT --> ERROR : E_PRE_FLIGHT_FAIL
    GATE_WAIT --> RUN_READY : E_GATE_CHECK_OK
    GATE_WAIT --> RUN_READY : E_GATE_CHECK_FAIL_WARN
    GATE_WAIT --> ERROR : E_GATE_CHECK_FAIL_ERROR
    RUN_READY --> RUNNING : E_CASE_START
    RUNNING --> RUNNING : E_STEP_DONE
    RUNNING --> RUN_COMPLETED : E_RUN_OK
    RUNNING --> RUN_COMPLETED : E_RUN_FAIL
    RUN_COMPLETED --> TEARDOWN : E_CASE_START
    TEARDOWN --> GATE_WAIT : next_case
    TEARDOWN --> IDLE : all_cases_done
    ERROR --> GATE_WAIT : E_RETRY
    ERROR --> CANCELLED : E_CANCEL
    CANCELLED --> IDLE : E_UNMOUNT
    RUN_READY --> CANCELLED : E_CANCEL
    RUNNING --> CANCELLED : E_CANCEL
    RUN_COMPLETED --> CANCELLED : E_CANCEL
    IDLE --> [*] : process_end

Race and stale-event handling

Older in-flight run events are ignored using run_id guard (G_fresh).
If E_NEW_INPUTS arrives while RUNNING, latest override is accepted only after current run enters TEARDOWN.
E_CANCEL always has priority over E_STEP_DONE and transitions directly to CANCELLED.
On unmount, only the latest active run ID is allowed to persist output; stale completions are dropped.

Edge-coverage tests

Start in IDLE; strict gate missing-coreml in error mode => GATE_WAIT -> ERROR.
Warn mode missing-coreml => GATE_WAIT -> RUN_READY with warning metadata.
RUNNING stale result while next run started => stale event dropped, active run continues.
E_CANCEL during RUNNING => no additional case launches after current step.
Retry path from ERROR executes when retry budget remains and clears last failed case cache.
toolchain_gate_coreml_issue populated only when gate failure string contains coreml keywords.